Discovering Runaway Slave Ads

These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.

Fugitive Slave IconThus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:

(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).

OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.

So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.

Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. Sample background imageA typical image looked something like this.

Negative sample for training the HAAR cascadeAfter running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.

Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.

Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256

Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.

I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..

Next Steps:

  • Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
  • Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
  • If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
  • I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.

    Marked up Sudoku board using Hough Line Transform

    Sudoku hough line transform

    Hough Line Transform output

    Best we can do so far..

    If you have any tips or feedback, feel free to contact Franco (@FrancoBettati31) or me (@brawnstein) on Twitter, or leave a comment below. Thanks!

3 Responses to Discovering Runaway Slave Ads

  1. This is a great report! Impressed by what you’ve been able to do so far.

    Some of the results I got when Googling the “Required leaf false alarm” error suggests that not all hope is lost for getting that to work. I saw posts, as you did, suggesting that the ratio of positive to negative samples might need adjusting, or perhaps a lower number of stages? If it looks like you need more computing power, let me know and maybe we can request some space on more powerful machines. You also make good points, though, about what we would lose by this method even if it worked perfectly—lots of ads don’t have the icon.

    Your dice_image gist is really cool! It made me wonder whether using it to decompose the page into candidate articles might be a possibility. It obviously wouldn’t be as exact as if you could get the Hough Line Transforms to work, but it could be interesting. Even if we chunked the OCR text as we were talking about (i.e., search for a word, and chunk a certain number of words around it), we wouldn’t always be getting perfect articles. So what if you compared that method with image chunking using your script? I’m imagining something like this:

    1. Get the input image width, divide it by roughly the number of columns and make that value the resulting cropped image width.
    2. Set the cropped image height to something sensible, a little bit larger than the height of a typical ad.
    3. Dice the image, storing the start_x and start_y coordinates and correlating them in an index file with unique pic_num.
    4. OCR the resulting chunks, and then maybe feed those into MALLET the way Nelson did for the Dispatch to see if a candidate “runaway ad” model emerges that could then lead you back to the exemplary chunks to find candidate ads.

    MALLET could have trouble with the fact that the chunks will often bridge articles. Could experiment with different height values, decreasing it as needed. It’s not perfect, but nothing is … At least if the cascade classifier method didn’t work, you’d then have another computer-vision method to compare with any text-chunking method you devised.

  2. Perhaps I can run createsamples with just image intensity adjustments. That way we could increase the positive sample size by a couple hundred. That said, I really just need to go back and study the documentation to make sure I have all the parameters right. Once we’re ready to give it another try, more computing power would be great. First though, (tomorrow) let’s weigh the pros and cons of icon detection so we don’t waste time or resources.

    I agree about rough page chunking. It’s worth looking into how conserved the newspaper layout was over time in terms of number and width of the columns. Even if we had to manually set up column widths for each newspaper (e.g. Telegraph and Texas Register), if a certain page number’s layout was fairly consistent over time, that should be plenty good. Or (just throwing an idea out there) if there were a few templates for each page, we might use the rough results of Hough Line Transform to identify which template a page uses. However, I can’t imagine dice_images would produce reasonable horizontal divisions because of how varied article lengths are. If it did, I think your algorithm would work well.

    But again, I really just think we need to step back and identify our goals. (They’re not so clear in my head, but maybe that’s just me..) Is it pure runaway ad detection? Because if so, just looking for keywords would be simpler. We could generate a word frequency table using known runaway ads. Use the number of matches, as well as a baseline frequency each match word occurs in runaway ads vs other sections of the newspaper to generate a confidence that each newspaper page contains a runaway ad. Then, set a threshold for that score (confidence) to warrant further (human) inspection. Franco and I are taking the course Web Application Development–if we went with this system and got a lot of hits, perhaps we could even make a website to outsource the process of determining which are legit. Similar to how the New York Public Library enlists the Internet to transcribe historical restaurant menus. It would probably be more than a 2-month project though.

    The Portal to Texas History has a lot of OCR data to work with, but if that collection has already largely been analyzed (or are we trailblazers?), we might set our sights just on creating a good OCR framework to machine transcribe newspaper page images (again, I don’t know if this is even an untapped realm or if such a collection even exists to work with). I noticed that the ocr txt files for UNT’s newspaper collection have a problem that doesn’t make much of a difference for keyword search but is problematic for chunking articles. It is that the OCR text from one column is sometimes intertwined with text from another column. Perhaps by first breaking the page up into columns, we could run OCR on each cropped image then concatenate the results for a more accurate transcription. We might use a system similar to Ted Underwoods’ for post-processing the text. I also found a method that uses Google’s “Did you mean?” results to spell-correct. We might accidentally over-correct old spellings of words, though.

    I am hesitant to use MALLET, though maybe that’s just because I don’t entirely understand it. MALLET might do the trick in generating the confidence levels, but we’re not interested in all the potential topics, just that one. Not to mention how would we even ensure it has a specific, inclusive topic for “runaway slave ad?” I think my idea is doing the same thing, but with a fixed “topic” for runaway slave ad with pre-defined keywords.

    If our goal is to produce decomposed article images/text for further use by us and others, that might be too lofty a vision. If nothing else, our work this semester will be a learning experience!

  3. With a research team at the University of Nebraska-Lincoln, I’m working on the detection of poetic content in newspapers from the Chronicling America corpus based on image analysis rather than textual analysis. We haven’t worried about image segmentation yet, as we’re still working on training the classifier to deal with poetic content and non-poetic content, but in the eventual implementation, we will use image segmentation. I’ll be eager to see how your research develops and once we get a little further in our work would be happy to talk about our strategies.

    I’m really excited about your project, and I hope you’ll continue to work on developing your content-based image retrieval, rather than (or in addition to) focusing on the OCR text. The digital images we are creating for/in these large-scale collections have far more information value than we are currently leveraging, when we use them only as visual facsimiles for human readers. If we focus exclusively on the electronic text, we are ignoring another major source of information that potentially allows us to ask and answer different questions. In addition, advancing image segmentation for historic newspapers and utilizing image processing for content-based analysis has the potential to make resources such as Chronicling America and the Texas Newspapers Project even more usable, beyond a single research interest.

    On another note, a student here is using MALLET to pursue perhaps a similar type of research question, though again with poetic content rather than the runaway slave ads. In Rob Nelson’s analysis the poetry of the Richmond Daily Dispatch grouped with patriotism, if I remember correctly. The student working here–and I’ll encourage her to comment on her own–is analyzing the poetic content of a newspaper form the duration of the Civil War as well as the content of the newspaper as a whole, in part to see whether there is a marker of poetic content and to pursue some other research questions. Ultimately,we’re interested in how a combination of strategies can help us learn more about 19th-c. newspaper poetry.