Category Archives: Programming

Script for Counting Ads

Some of you expressed an interest in being able to quickly count all the ads in a folder and determine how many were published in a given year, decade, or month (to detect seasonal patterns across the year).

Here is a script that can do that. It is designed to work on Mac or Linux systems.

To use it, you should first download our adparsers repo by clicking on the "Download Zip" button on this page:

Download the adparsers repo as a zip

Download the adparsers repo as a zip

Unzip the downloaded file, and you should then have a directory that contains (among other things) the countads.sh script.

You should now copy the file to the directory that contains the ads you want to count. You can do this the drag-and-drop way, or you can use your terminal and the cp command. (If you forgot what that command does, revisit the Command Line bootcamp that was part of the MALLET homework. Once the script is in the directory, navigate to that directory in your terminal, and then run the command like this:

 ./countads.sh

If you get an error message, you may need to follow the instructions in the comments at the start of the script (which you can read on GitHub) to change the permissions. But if all goes well, you’ll see a printed breakdown of chronological counts. For example, when I run the script in the directory containing all our Mississippi ads, the script returns this:

TOTAL   1632 
     
DEC     ADS
1830s   1118
1840s   178
1850s   133
1860s   4
     
YEAR    ADS
1830    30
1831    54
1832    87
1833    68
1834    143
1835    157
1836    262
1837    226
1838    63
1839    28
1840    16
1841    16
1842    22
1843    33
1844    44
1845    25
1846    14
1847    1
1848    5
1849    2
1850    11
1851    17
1852    19
1853    15
1854    7
1855    9
1856    11
1857    23
1858    13
1859    8
1860    4
     
MONTH   ADS
1   100
2   89
3   103
4   130
5   160
6   161
7   188
8   150
9   150
10  149
11  146
12  86

If you choose, you can also "redirect" this output to a file, like this:

./countads.sh > filename.txt

Now you should be able to open filename.txt (which you can name whatever you want) in Microsoft Excel, and you’ll have a spreadsheet with all the numbers.

The script may seem to have limited value, but the key to its utility lies in first getting an interesting set of ads into a directory. That extends its usefulness. For example, if you wanted only to know the month distribution of ads in a particular year, you could first move all the ads from that year into a directory, and run the script from within it. You’d get lots of zeroes for all the years that you’re not interested in, but you would get the month breakdown that you are interested in. Depending on which ads you put in the directory that you are counting in, you can get a lot of useful data that can then be graphed or added into further calculations.

Discovering Runaway Slave Ads

These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.

Fugitive Slave IconThus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:

(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).

OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.

So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.

Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. Sample background imageA typical image looked something like this.

Negative sample for training the HAAR cascadeAfter running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.

Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.

Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256

Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.

I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..

Next Steps:

  • Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
  • Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
  • If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
  • I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.

    Marked up Sudoku board using Hough Line Transform

    Sudoku hough line transform

    Hough Line Transform output

    Best we can do so far..

    If you have any tips or feedback, feel free to contact Franco (@FrancoBettati31) or me (@brawnstein) on Twitter, or leave a comment below. Thanks!

JSON Examples and Links

If you’d like to look more closely at the JSON examples discussed in class, here are the exhibits from the handout. To test their validity, you can copy each one to your clipboard and paste it into the JSONLint site and click on "Validate." You may also want to take a look at the JSON specification page that I had up on the screen.

If you still feel a bit lost with these examples, don’t worry; we will spend more time clearing up confusion on Friday and throughout the next week. The point of these exercises is to show some of the challenge that comes from representing information that is interesting to humanists in formats that computers can more easily digest. On Friday, we’ll also talk about the arguably more challenging task of deciding what information we want to represent!

These are the other links that were discussed today:

Finally, after today’s lightning-quick introduction, you may be interested in knowing why historian Ian Milligan thinks that JSON rocks.

Up Next: Natalie Houston and Neal Audenaert

Don’t forget that we will be having our first meeting of the semester this Friday at 5:30 in Keck Hall, Room 101 (the building with Valhalla). Dinner will be provided for everyone at the beginning.

Our guests for the workshop this Friday afternoon are Natalie Houston and Neal Audenaert. They have received a start-up grant from the NEH Office of Digital Humanities to build a program they are calling The Visual Page.

Continue reading

Learning Python, Part II

In my last Python post, I learned how to get a single webpage from one of my old blogs and convert it from HTML into Markdown. My objective, if you recall, is to take a list of posts from my old blog and convert them into an EPUB.

I chose this task mainly to give me a reasonable goal while learning python, but I’m also thinking about some of the practical uses for a script like this. For instance, say you had a list of webpages containing primary source transcriptions that you wanted your students to read. A script like the one I’m trying to write could conceivably be used to package all of those sources in a single PDF or EPUB file that could then be distributed to students. The popularity of plugins like Anthologize also indicate that there is a general interest in converting blogs into electronic books, but that plugin only works with WordPress. A python script could conceivably do this for any blog.

I’m quickly learning, however, that making this script portable will require quite a bit of tweaking. Which is another way of saying "quite a bit of geeky fun"!

Continue reading

Learning Python, Part I

This semester I have been trying to learn a little bit about Python, an open-source programming language. Python is the language used in the introductory undergraduate course for Computer Science here at Rice (and it’s actually the basis for Rice’s first Coursera offering). Fortunately, however, there is an even more introductory course on Python aimed historians. It’s called The Programming Historian, and it’s what I’m using to get started with the language.

The Programming Historian offers lessons and example code that can be immediately useful. Indeed, we heard from Scott Nesbit earlier this semester that he used the lesson on Automated Downloading with Wget to download the Official Records webpages used for the Visualizing Emancipation project. The Programming Historian is also built on the assumption that the best way to learn programming is to do it. History graduate student Jason Heppler puts the point this way in his excellent essay How I Learned to Code:

How do I continue to learn? I simply dig in. Computers are best learned not though books or lecture, but by hands-on experience. … I learn and I write. I trace other’s code to see what each line of code does and how everything fits together.

I decided that if I was going to learn Python, I was going to have to "just dig in" too. So I came up with something I wanted to do, and I’ve set out to see if I can do it in Python.

Continue reading