Please view our final project website, Digital History Methods, to see what we produced in the Spring 2014 semester!
Author Archives: Caleb McDaniel
To complete your work for this class, you need to do three things:
Over the weekend, resolve any issues that have been filed in our Github repository for the page(s) you are responsible for on the final website. (UPDATE: Please also comment on this important new issue regarding a statement on our front page about the nature of our sources.) We will consider our final project to be in its finished form at 11:59pm this Sunday.
After the project has been finalized, write a report of approximately 950 to 1,250 words. Your report should assess how well our website meets its objective of demonstrating the possibilities and limits of digital history methods to an audience of historians and scholars interested in digital humanities but new to the field. Be specific about the things that you think work well to meet this objective, and the things that would most need work or expansion in future iterations. At the end of your essay, give the final project a score of up to 30 points (with 30 being the best possible score) based on your assessment of its quality. Your final report grade will be an average of your own score together with a score of up to 30 points that I will assign your report based on how well, and with how much specific evidence, you make the case for your assessment.
Respond to the final "team participation" questionnaire of the semester, which I will email to you all individually.
Thanks for a great semester, and let me know if you have any questions!
I’ve put the to-do list that we made in class yesterday in the README file for our Drafts repo. For the next week, do all of your work directly on the Github files. Remember to "commit" your changes so you don’t lose work! I recommend drafting your work in a text editor and saving copies on your machine just in case.
Note that if you encounter a message while editing saying that someone else has changed the file while you worked, you should see a red banner message at the top prompting you to review the other person’s changes. Click on the review changes link in that banner, and you should see the new text appear in your editing window along with yours. Now you can commit the file incorporating your changes.
Some of you expressed an interest in being able to quickly count all the ads in a folder and determine how many were published in a given year, decade, or month (to detect seasonal patterns across the year).
Here is a script that can do that. It is designed to work on Mac or Linux systems.
To use it, you should first download our
adparsers repo by clicking on the "Download Zip" button on this page:
Unzip the downloaded file, and you should then have a directory that contains (among other things) the
You should now copy the file to the directory that contains the ads you want to count. You can do this the drag-and-drop way, or you can use your terminal and the
cp command. (If you forgot what that command does, revisit the Command Line bootcamp that was part of the MALLET homework. Once the script is in the directory, navigate to that directory in your terminal, and then run the command like this:
If you get an error message, you may need to follow the instructions in the comments at the start of the script (which you can read on GitHub) to change the permissions. But if all goes well, you’ll see a printed breakdown of chronological counts. For example, when I run the script in the directory containing all our Mississippi ads, the script returns this:
TOTAL 1632 DEC ADS 1830s 1118 1840s 178 1850s 133 1860s 4 YEAR ADS 1830 30 1831 54 1832 87 1833 68 1834 143 1835 157 1836 262 1837 226 1838 63 1839 28 1840 16 1841 16 1842 22 1843 33 1844 44 1845 25 1846 14 1847 1 1848 5 1849 2 1850 11 1851 17 1852 19 1853 15 1854 7 1855 9 1856 11 1857 23 1858 13 1859 8 1860 4 MONTH ADS 1 100 2 89 3 103 4 130 5 160 6 161 7 188 8 150 9 150 10 149 11 146 12 86
If you choose, you can also "redirect" this output to a file, like this:
./countads.sh > filename.txt
Now you should be able to open
filename.txt (which you can name whatever you want) in Microsoft Excel, and you’ll have a spreadsheet with all the numbers.
The script may seem to have limited value, but the key to its utility lies in first getting an interesting set of ads into a directory. That extends its usefulness. For example, if you wanted only to know the month distribution of ads in a particular year, you could first move all the ads from that year into a directory, and run the script from within it. You’d get lots of zeroes for all the years that you’re not interested in, but you would get the month breakdown that you are interested in. Depending on which ads you put in the directory that you are counting in, you can get a lot of useful data that can then be graphed or added into further calculations.
As your groups have begun drafting essays for our final product, some of you have asked me how to figure out how to recompose the permalink to a Texas ad using the information in the ad’s txt filename. Here’s a quick tutorial. Continue reading
You may have noticed from my posts on Twitter that today is Day of DH 2014. To make a long story short, on #DayofDH , digital humanities scholars and teachers create special blogs to document their work for that day and to connect with like-minded scholars. Check it out if you want to learn more about the DH field writ large!
For my blog, I wrote a little bit about our Twitter bot, and particularly shared how I have now set up my computer to tweet an ad automatically every morning. As I mentioned in class yesterday, we now have around 70 followers of the Twitter account, with a couple more adding each day. Exciting times!
Now that our basic idea for the Twitter bot is up and running, perhaps we can also talk about whether there is anything else we want to add to it.
One potential limitation of our current set up is that only those who have followed us are likely to see our tweets (except when one of our followers retweets an ad, which hasn’t really happened yet). But one of our stated goals in the essay was to "surprise" people by showing them an ad in a context where they don’t expect it. We will still accomplish that with our followers, but their "surprise" will be lessened by the fact that they have decided to follow our account. Any ideas about how we can increase the distribution of and audience for the Tweets, particularly among our non-followers?
Another idea that Alyssa brought up in class was to add to our account some regular "on this day" tweets. If you have ideas about how such tweets should be worded, please share them in the comments. There may be some way to word these OTD tweets in a way that solves the problem above. Open to your suggestions!
You may have noticed that I was able to put a pretty clean ZIP file of Arkansas ads into our private repository. As you know, we’ve had some difficulties copying and pasting text from the wonderful PDFs posted by the Documenting Runaway Slaves project: namely, copying and pasting from the PDF into a text file results in footnotes and page numbers being mixed in with the text. Funny things also happen when there are superscript characters. This makes it difficult for us to do the kinds of text mining and Named Entity Recognition that we’re most interested in. But in this post I’ll quickly share how I dealt with these difficulties.
The key first step was provided by this tutorial on using the Automator program bundled with most Mac computers to extract Rich Text from PDFs. The workflow I created looked like this:
Extracting the text as "Rich Text" was the key. Running this workflow put an RTF file on my desktop that I then opened in Microsoft Word, which (I must now grudgingly admit) has some very useful features for a job like this. When I opened the file, for example, I noticed that all of the footnote text was a certain font size. I then used Word’s find and replace formatted text function to find and eliminate all text of that font size.
I used a similar technique to get rid of all the footnote reference numbers in the text, but in this case I had to be more specific because some of the text I wanted to preserve (like superscript "th," "st, and "nd" for ordinal numbers like "4th," "1st," and "2nd") was the same font size as the footnote markers. So I used Word’s native version of regular expressions (called wildcards) to find only numbers of that font size. In other words, the "Advanced Find and Replace" dialogue I used looked like this:
I used the same technique to eliminate the reference numbers leftover from the eliminated footnotes, which were all of an even smaller font size. Similar adjustments can be made by noticing that many of the ordinal suffixes mentioned earlier ("th," "st," and "nd") are "raised" or "lowered" by a certain number of points. You can see this by selecting those abbreviations and then opening the Font window in Word. Clicking on the "Advanced" tab will reveal whether the text has been lowered or raised. An advanced find and replace to change all text raised or lowered by specific points with text that is not raised or lowered fixed some, though not all, of these problems.
At this point I reached the limit of what I could do with the formatting find and replace features in Word, so I saved my document as a Plain Text file (with the UTF-8 encoding option checked to make things easier later on our Python parsing script), and then opened it up in a text editor. At this point I noticed that there were still some problems (though not as many!) in the text:
The main problem seems to arise in cases where there was a superscript ordinal suffix in the first line of an ad. As you can see, the "th" ends up getting booted up to the first line, and the remainder of the line gets booted down to the bottom of the page. Fortunately, there seems to be some pattern to this madness, a pattern susceptible to regular expressions. I also noticed that the orphaned line fragments following ordinals seem to always be moved to the bottom of the "page" right before the page number (in this case "16"). This made it possible to do a regex search for any lines ending in "th" (or "st" or "nd") followed by another line ending in a number, followed by a replacement that moves the suffix to where it should be. Though it took a while to manually confirm each of these replacements (I was worried about inadvertently destroying text), it wasn’t too hard to do.
A second regex search for page numbers allowed me to find all of the orphan fragments and manually move them to the lines where they should be (checking the master file from DRS in cases where it wasn’t clear which ad each fragment went with). The final step (which we already learned how to do in class) was to use a regular expression to remove all the year headers and page numbers from the file, as well as any blank lines. Franco’s
drsparser script did the rest of the work of bursting the text file into individual ads and named the files using the provided metadata.
Daniel’s question in class on Monday, about whether we were planning to release the ads we have found to the public, reminded me that we had earlier discussed the possibility of tweeting out our transcriptions with a link to the zoomed image in Portal of Texas History.
This tutorial suggests that may not be too difficult, especially now that we have a way to get all of our transcriptions out of our spreadsheets and into text files. It would be possible to write a script that reconstructs the URL to the page image from the title of our text files, and then tweets the first several words of the transcription with a link. That could be a way both of sharing our finds and of increasing interest in our larger project.
Is this something we would still be interested in doing? Thoughts?
As discussed yesterday in class, we are going to split up into new teams to begin working on our final web project.
Both teams will contribute two things to the final project:
- A webpage that contains an introduction to the question, a step-by-step section discussing different methods you tried to answer the question, and a summary of findings and questions for future research.
- At least one non-narrative visualization illustrating some of the team’s findings.
- A brief, traditional historical essay that answers the team’s question using a close reading of the available sources. These will be combined together on one page separate from the digital methods reports.
Your team should not only try the methods identified below to answer the question, but also reflect and report on whether these methods actually help us answer the question posed. You should also not take for granted that the corpora we have are already suitably prepared for the methods you want to try; your team may need to think through how to turn the transcriptions and metadata we’ve collected into datasets that are actually susceptible to the kinds of analysis you want to try.
Alyssa and Daniel
How similar were Texas ads to ads from the nearby states of Mississippi and Arkansas?
Use text mining methods, such as word trends (in Voyant), TF-IDF, and topic modeling, to compare corpora from Texas, Mississippi and Arkansas.
Aaron, Clare, and Kaitlyn
Judging from runaway ads, how were Texas, Mississippi, Arkansas, and Louisiana connected geographically in the antebellum period?
Use Named Entity Recognition to extract place names from the ad corpora and then try different methods (place count tables and graphs, Google maps, network graphs) to visualize how often places from one state were mentioned in another state’s ads.
Deadlines for Progress Reports
All progress reports from here on out will be due before class begins.
- March 31: Report should include schedule for team’s tasks and work and initial delegation of tasks among team members
- April 7: Report should include a draft of the "close reading" essay required for the final project, as well as update on other tasks
- April 14 and 21: Report on progress toward final webpage
Got questions? Leave comments!