Category Archives: Runaway Ads

Script for Counting Ads

Some of you expressed an interest in being able to quickly count all the ads in a folder and determine how many were published in a given year, decade, or month (to detect seasonal patterns across the year).

Here is a script that can do that. It is designed to work on Mac or Linux systems.

To use it, you should first download our adparsers repo by clicking on the "Download Zip" button on this page:

Download the adparsers repo as a zip

Download the adparsers repo as a zip

Unzip the downloaded file, and you should then have a directory that contains (among other things) the script.

You should now copy the file to the directory that contains the ads you want to count. You can do this the drag-and-drop way, or you can use your terminal and the cp command. (If you forgot what that command does, revisit the Command Line bootcamp that was part of the MALLET homework. Once the script is in the directory, navigate to that directory in your terminal, and then run the command like this:


If you get an error message, you may need to follow the instructions in the comments at the start of the script (which you can read on GitHub) to change the permissions. But if all goes well, you’ll see a printed breakdown of chronological counts. For example, when I run the script in the directory containing all our Mississippi ads, the script returns this:

TOTAL   1632 
1830s   1118
1840s   178
1850s   133
1860s   4
1830    30
1831    54
1832    87
1833    68
1834    143
1835    157
1836    262
1837    226
1838    63
1839    28
1840    16
1841    16
1842    22
1843    33
1844    44
1845    25
1846    14
1847    1
1848    5
1849    2
1850    11
1851    17
1852    19
1853    15
1854    7
1855    9
1856    11
1857    23
1858    13
1859    8
1860    4
1   100
2   89
3   103
4   130
5   160
6   161
7   188
8   150
9   150
10  149
11  146
12  86

If you choose, you can also "redirect" this output to a file, like this:

./ > filename.txt

Now you should be able to open filename.txt (which you can name whatever you want) in Microsoft Excel, and you’ll have a spreadsheet with all the numbers.

The script may seem to have limited value, but the key to its utility lies in first getting an interesting set of ads into a directory. That extends its usefulness. For example, if you wanted only to know the month distribution of ads in a particular year, you could first move all the ads from that year into a directory, and run the script from within it. You’d get lots of zeroes for all the years that you’re not interested in, but you would get the month breakdown that you are interested in. Depending on which ads you put in the directory that you are counting in, you can get a lot of useful data that can then be graphed or added into further calculations.

Reassembling URLs from Files

As your groups have begun drafting essays for our final product, some of you have asked me how to figure out how to recompose the permalink to a Texas ad using the information in the ad’s txt filename. Here’s a quick tutorial. Continue reading

Updates on Twitter Bot

You may have noticed from my posts on Twitter that today is Day of DH 2014. To make a long story short, on #DayofDH , digital humanities scholars and teachers create special blogs to document their work for that day and to connect with like-minded scholars. Check it out if you want to learn more about the DH field writ large!

DayofDH logo

It's like a holiday, for digital humanists.

For my blog, I wrote a little bit about our Twitter bot, and particularly shared how I have now set up my computer to tweet an ad automatically every morning. As I mentioned in class yesterday, we now have around 70 followers of the Twitter account, with a couple more adding each day. Exciting times!

Now that our basic idea for the Twitter bot is up and running, perhaps we can also talk about whether there is anything else we want to add to it.

One potential limitation of our current set up is that only those who have followed us are likely to see our tweets (except when one of our followers retweets an ad, which hasn’t really happened yet). But one of our stated goals in the essay was to "surprise" people by showing them an ad in a context where they don’t expect it. We will still accomplish that with our followers, but their "surprise" will be lessened by the fact that they have decided to follow our account. Any ideas about how we can increase the distribution of and audience for the Tweets, particularly among our non-followers?

Another idea that Alyssa brought up in class was to add to our account some regular "on this day" tweets. If you have ideas about how such tweets should be worded, please share them in the comments. There may be some way to word these OTD tweets in a way that solves the problem above. Open to your suggestions!

Posting Ads to Twitter

Daniel’s question in class on Monday, about whether we were planning to release the ads we have found to the public, reminded me that we had earlier discussed the possibility of tweeting out our transcriptions with a link to the zoomed image in Portal of Texas History.

This tutorial suggests that may not be too difficult, especially now that we have a way to get all of our transcriptions out of our spreadsheets and into text files. It would be possible to write a script that reconstructs the URL to the page image from the title of our text files, and then tweets the first several words of the transcription with a link. That could be a way both of sharing our finds and of increasing interest in our larger project.

Is this something we would still be interested in doing? Thoughts?

Progress Report #1 Tasks

As indicated on the syllabus, your first Progress Report on our class project is due this Monday, March 17, by the end of class. The progress report should take the form of a correctly formatted, hyperlink-rich post to this blog. Each group needs to make only one post, but you should work together on the post and will be assigned a grade on the report as a group. Note that the report needs to show your progress, even if you haven’t yet completed all the tasks assigned to you. The groups/tasks we assigned last Monday are as follows, but keep in mind that groups and tasks will shift as we move forward.

Continue reading

Discovering Runaway Slave Ads

These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.

Fugitive Slave IconThus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:

(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).

OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.

So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.

Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. Sample background imageA typical image looked something like this.

Negative sample for training the HAAR cascadeAfter running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.

Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.

Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256

Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.

I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..

Next Steps:

  • Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
  • Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
  • If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
  • I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.

    Marked up Sudoku board using Hough Line Transform

    Sudoku hough line transform

    Hough Line Transform output

    Best we can do so far..

    If you have any tips or feedback, feel free to contact Franco (@FrancoBettati31) or me (@brawnstein) on Twitter, or leave a comment below. Thanks!

Group Presentations

In the next two weeks of class, we will divide our labor so that we can learn about some different kinds of digital tools that might help us answer (or more effectively present our answers) to our questions about slavery and runaway slave ads in Texas.

You will work with a partner to work through some tutorials (much like you did for Homework #3), and then talk with your partner about how this tool (or others like it) might be useful for our class. Your final task will then be to report back to the class on what you have done with an oral presentation that gives your classmates a sense of what the tool can do and what it might do for us.

Continue reading

Our Questions

Data Analysis Questions

  • What was the typical profile of a subscriber?
  • What was the typical profile of an advertised runaway slave?
  • Did people advertised run away in groups, and if so, what kinds of groups?
  • How long did an ad run, how often was it reprinted, and did rewards increase over time?
  • What techniques did runaways use to escape? How often did they succeed, or try again if they were captured?
  • Why did particular individuals run away, and where were they suspected to have gone? (How did slaveholders answer these questions, as opposed to the answers runaways would have given?)
  • When (in the year, or in all the years) were runaway ads most likely to appear?
  • How did subscribers describe or think about enslaved people advertised in the ads? As individuals or anonymous laborers? How do slaveholders react to runaways?
  • What systems or factors prevented successful escapes? What were runaways up against?
  • How long after an escape did slaveholders wait to advertise?
  • How often (and how widely) were ads reprinted?

With regard to all of these questions, we also have been wondering:

  1. What difference did the distinctive characteristics of Texas make for the answers of any of these questions? Did Texas depart from or conform to patterns like those shown in the Franklin and Schweninger book?
  2. How did the answers of any of these questions change over time?

Data Visualization Questions

  • Would it be possible to visualize, using the ads, the differences between the slaveholders’ “geography of containment” and the slaves’ “rival geography”?
  • What visualizations of our data would most effectively communicate what slavery meant or was like for the enslaved? Which visualizations might persuade even better (or in a different way) than a written historical narrative?
  • Which visualizations might allow us to better see answers to our analytical questions?
  • What would be the purpose of different visualizations of the data, ranging from numbers on a table, to a map, to images, to an interactive game?

Data Exploration Questions

  • Would it be possible to make the discovery of runaway ads more automated, using page images or OCR text of the newspaper pages?

Comments on Metadata

Our collaborators over at the University of North Texas have been posting about questions related to metadata in their runaway slave ad database. Use this post to comment on their posts, linking directly to the post(s) you want to respond to.

Video Game History

Yesterday we talked briefly in class about two web-based role-playing games in which the player assumes the role of a runaway slave. Both are called "Flight to Freedom," the first at Bowdoin College and the second at the NEH-sponsored site Mission US.

Screen shot from Mission US game

Screen shot from Mission US game, Flight to Freedom

One of the issues that came up in our discussion was whether "gamifying" the history of slavery is ever appropriate, and as it turns out, this issue has just been in the news because of a new slavery-centered game in the Assassin’s Creed series.

Another question that our discussion raised was whether video games can be considered "history" at all. This, too, is a long-standing question, and there is a very thoughtful group blog called Play the Past devoted to considering it. It’s a blog often written for historians by historians, and just yesterday historian Trevor Owens argued that games can be historical scholarship. His argument hinges on many of the points that came up yesterday in class about the widening of access to history, and the benefits of breaking away from a linear narrative form of argument that is itself no less interpretive than a "non-fiction game" would be.

Of course, even if we could agree that games could be history, that wouldn’t necessarily mean that the history of slavery should be represented with a game. At the very least, perhaps, the article on "Assassin’s Creed" suggests that with games, as with essays, there are better and worse ways to proceed. Play the Past has also published some past posts dealing with attempts to represent slavery in game-play situations, including this one by Rebecca Mir, and this one by Mark Sample about "playing the powerless in videogames about the powerless," which pointed me to the Mission US game in the first place.

In that post, Sample asks:

What are the limits of playing the powerless? What is lost and what is gained in portraying—and playing—a situation that has been well represented in other media? And what considerations should developers and players alike have with regards to responsibility and accountability?

If you’re interested in continuing this discussion, I’d be interested to hear your answers to those questions. Did you read the review of Assassin’s Creed, or play the Flight to Freedom games? What do you think about the limitations and possibilities of games as historical scholarship or as public history?