Category Archives: Tutorials

Script for Counting Ads

Some of you expressed an interest in being able to quickly count all the ads in a folder and determine how many were published in a given year, decade, or month (to detect seasonal patterns across the year).

Here is a script that can do that. It is designed to work on Mac or Linux systems.

To use it, you should first download our adparsers repo by clicking on the "Download Zip" button on this page:

Download the adparsers repo as a zip

Download the adparsers repo as a zip

Unzip the downloaded file, and you should then have a directory that contains (among other things) the countads.sh script.

You should now copy the file to the directory that contains the ads you want to count. You can do this the drag-and-drop way, or you can use your terminal and the cp command. (If you forgot what that command does, revisit the Command Line bootcamp that was part of the MALLET homework. Once the script is in the directory, navigate to that directory in your terminal, and then run the command like this:

 ./countads.sh

If you get an error message, you may need to follow the instructions in the comments at the start of the script (which you can read on GitHub) to change the permissions. But if all goes well, you’ll see a printed breakdown of chronological counts. For example, when I run the script in the directory containing all our Mississippi ads, the script returns this:

TOTAL   1632 
     
DEC     ADS
1830s   1118
1840s   178
1850s   133
1860s   4
     
YEAR    ADS
1830    30
1831    54
1832    87
1833    68
1834    143
1835    157
1836    262
1837    226
1838    63
1839    28
1840    16
1841    16
1842    22
1843    33
1844    44
1845    25
1846    14
1847    1
1848    5
1849    2
1850    11
1851    17
1852    19
1853    15
1854    7
1855    9
1856    11
1857    23
1858    13
1859    8
1860    4
     
MONTH   ADS
1   100
2   89
3   103
4   130
5   160
6   161
7   188
8   150
9   150
10  149
11  146
12  86

If you choose, you can also "redirect" this output to a file, like this:

./countads.sh > filename.txt

Now you should be able to open filename.txt (which you can name whatever you want) in Microsoft Excel, and you’ll have a spreadsheet with all the numbers.

The script may seem to have limited value, but the key to its utility lies in first getting an interesting set of ads into a directory. That extends its usefulness. For example, if you wanted only to know the month distribution of ads in a particular year, you could first move all the ads from that year into a directory, and run the script from within it. You’d get lots of zeroes for all the years that you’re not interested in, but you would get the month breakdown that you are interested in. Depending on which ads you put in the directory that you are counting in, you can get a lot of useful data that can then be graphed or added into further calculations.

Reassembling URLs from Files

As your groups have begun drafting essays for our final product, some of you have asked me how to figure out how to recompose the permalink to a Texas ad using the information in the ad’s txt filename. Here’s a quick tutorial. Continue reading

Getting Ads from PDFs

You may have noticed that I was able to put a pretty clean ZIP file of Arkansas ads into our private repository. As you know, we’ve had some difficulties copying and pasting text from the wonderful PDFs posted by the Documenting Runaway Slaves project: namely, copying and pasting from the PDF into a text file results in footnotes and page numbers being mixed in with the text. Funny things also happen when there are superscript characters. This makes it difficult for us to do the kinds of text mining and Named Entity Recognition that we’re most interested in. But in this post I’ll quickly share how I dealt with these difficulties.

The key first step was provided by this tutorial on using the Automator program bundled with most Mac computers to extract Rich Text from PDFs. The workflow I created looked like this:

Screen shot of Automator workflow

Screen shot of Automator workflow

Extracting the text as "Rich Text" was the key. Running this workflow put an RTF file on my desktop that I then opened in Microsoft Word, which (I must now grudgingly admit) has some very useful features for a job like this. When I opened the file, for example, I noticed that all of the footnote text was a certain font size. I then used Word’s find and replace formatted text function to find and eliminate all text of that font size.

I used a similar technique to get rid of all the footnote reference numbers in the text, but in this case I had to be more specific because some of the text I wanted to preserve (like superscript "th," "st, and "nd" for ordinal numbers like "4th," "1st," and "2nd") was the same font size as the footnote markers. So I used Word’s native version of regular expressions (called wildcards) to find only numbers of that font size. In other words, the "Advanced Find and Replace" dialogue I used looked like this:

Word find and replace dialogue with wildcards

Word find and replace dialogue with wildcards

I used the same technique to eliminate the reference numbers leftover from the eliminated footnotes, which were all of an even smaller font size. Similar adjustments can be made by noticing that many of the ordinal suffixes mentioned earlier ("th," "st," and "nd") are "raised" or "lowered" by a certain number of points. You can see this by selecting those abbreviations and then opening the Font window in Word. Clicking on the "Advanced" tab will reveal whether the text has been lowered or raised. An advanced find and replace to change all text raised or lowered by specific points with text that is not raised or lowered fixed some, though not all, of these problems.

At this point I reached the limit of what I could do with the formatting find and replace features in Word, so I saved my document as a Plain Text file (with the UTF-8 encoding option checked to make things easier later on our Python parsing script), and then opened it up in a text editor. At this point I noticed that there were still some problems (though not as many!) in the text:

Houston, we have a problem

Houston, we have a problem

The main problem seems to arise in cases where there was a superscript ordinal suffix in the first line of an ad. As you can see, the "th" ends up getting booted up to the first line, and the remainder of the line gets booted down to the bottom of the page. Fortunately, there seems to be some pattern to this madness, a pattern susceptible to regular expressions. I also noticed that the orphaned line fragments following ordinals seem to always be moved to the bottom of the "page" right before the page number (in this case "16"). This made it possible to do a regex search for any lines ending in "th" (or "st" or "nd") followed by another line ending in a number, followed by a replacement that moves the suffix to where it should be. Though it took a while to manually confirm each of these replacements (I was worried about inadvertently destroying text), it wasn’t too hard to do.

A second regex search for page numbers allowed me to find all of the orphan fragments and manually move them to the lines where they should be (checking the master file from DRS in cases where it wasn’t clear which ad each fragment went with). The final step (which we already learned how to do in class) was to use a regular expression to remove all the year headers and page numbers from the file, as well as any blank lines. Franco’s drsparser script did the rest of the work of bursting the text file into individual ads and named the files using the provided metadata.