Daniel and I have been working on looking more closely at the advertisements from Arkansas and Mississippi digitized in the Documenting Runaway Slaves project. Using regular expressions, we are cleaning up the text files in Text Wrangler to remove unwanted information, such as footnotes, extra dates, and page numbers. Our goal is to find out how many total ads there are for each state, how many ads there are in each particular newspaper, and how many ads there are between the years 1835-1860. Below is our progress divided by state.
Arkansas Advertisements – Daniel
By using regular expressions to search for individual dates for ads and separate them into individual text files, we were able to identify 457 separate ads for Arkansas. Within this subdivision, searching the years of the groups narrowed down the pool of ads to 324 within the range of 1835-1860.
Uploading the text to Voyant Tools, I was able to use the ResoViz tool to identify the different organizations in the ads. This gave a strong pointer towards which newspaper titles occur most frequently within the base of ads. Searching for these in the text in Text Wrangler, I was then able to count how many occurrences there were with the “Find All” feature. This search found 272 occurrences of the Arkansas Gazette. 28 of these were overcounted due to mentions in footnotes (which we were unable to remove from the PDF). Removing these left an adjusted count of 244 runaway ads in the Arkansas Gazette from 1835-1860. A similar search revealed the runner-up publications of ads to be 35 ads by the Washington Telegraph during this time, and 31 by the Arkansas Advocate.
Mississippi Advertisements – Kaitlyn
First, I removed the extra date headers by using the regular expression #1, posted as a gist on my github account. Then, I removed the page numbers by using regular expression #2. That’s when I started seeing some issues in how the text copied over from the PDF file I downloaded from the Documenting Runaway Slaves advertisement. As shown in the picture below, I discovered that every time a superscript (such as th, st, or nd) is used, the text does not copy over in the correct order.
As you can see on line 342, the text abruptly cuts off right where the th superscript should be, and the rest of the text that follows is now placed on line 351. The superscript has been placed on line 341 (or line 347 — both contain “th”). The superscripts for numbers were not used consistently throughout the document, so it is not a consistent problem for all of the advertisements. It also poses more of a problem when we start using the advertisements for analysis.
One other problem I discovered is that some of the dates in the [date Month year] format is that some of the lines end in a period, some do not have a period, and some have bracketed edited information. Therefore, I had to use regular expression #3 to figure out how many advertisements the document contained. I found 1633 matches, which was about four times as many as we found from Arkansas. I additionally used regular expression #4 to figure out how many advertisements we had from the period of 1835-1860, and I discovered 1060 matches. There possibly could have been a more effective way to do this, but I think I was able to find them all using that expression.
I am still working on figuring out how to remove all of the footnotes. The footnotes do not seem to have any similarities between them except for a number at the beginning of a line, so it is difficult to remove them without removing advertisement information as well. Additionally, I will use ResoViz to see how many advertisements we have from each newspaper as Daniel did with the Arkansas ads, but because there are too many ads collected from Mississippi to analyze them all at the same time using Voyant, this task is taking longer than I originally thought it would.