Author Archives: arb11

Progress Report: GeoTeam

Since our last progress report, we have completed the following tasks:

  • Clare revised the rough draft for the close reading essay. You can view the new draft at the bottom of the post.
  • Aaron revised to merge location entities that are in close proximity in the ad. For example, the raw results of NER for “Sheriff of Pulaski County, Arkansas” are “Pulaski County” and “Arkansas”. The script would convert those terms into a single expression, “Pulaski County, Arkansas”. This makes it easier to generate geo-coordinates for the referenced locations and to trim down the amount of location results. Additionally, we were having problems with incomplete results due to the word “County” being spelled in lowercase and abbreviated. The new version of the script pre-processes the text files to find/replace words such as these.
  • Aaron wrote a script to convert the output from into a mapping between each ad and the states referenced in that ad. It will be used to tally the number of references to each other state in our Texas, Mississippi, and Arkansas datasets.
  • Kaitlyn has been working on example maps using Google Fusion Tables. To generate state counts, she used the Find feature of her text editor to count number of occurrences of known state names (but not initials). Once we have more accurate numbers when is extended in functionality, we will be able to create a more accurate map.
  • Kaitlyn also test-drove Palladio. The following is her comments on it:

I was able to take a look at what Palladio has to offer for us, and I think it could be a really interesting tool because of the “point to point” mapping abilities. I quickly learned how to upload spreadsheets to Palladio and extend spreadsheets to certain variables. For example, I created a spreadsheet with columns “Year Ad Published,” “Slave Name,” “Owner Name,” “Owner Location,” “Runaway Location,” “Projected Location,” and “Permalink” and was able to link all of the location variables to a spreadsheet that contained coordinates for each place. Then, using the Palladio mapping tool, I was able to create a map that connected the Runaway Locations to the Projected Locations for each advertisement. Although I only have a few points right now, one can see how this tool could be useful for looking at how connected different places are to each other. If we want to use Palladio, we will need to start expanding the spreadsheet, which is time consuming because it requires manually inputting data. I think Palladio could be a useful tool for showing some of the outliers in our advertisement corpora.

Her comments on creating the fusion tables:

Using basic search functions, I have been taking the data that Aaron collected by running the ads through his tagging script and counting how many times state names are mentioned in each of the state corpora (I have been searching only for whole names right now; eg: “Texas” and not “TX” or “TEX”). This enables me to get a sense of what the google fusion table maps will look like with real data. The main issue that I have come across in doing this is coming up with a scale that will work across the Texas, Arkansas, and Mississippi ads. Because Arkansas and Mississippi have so many more ads than Texas, there is no way right now to line up the scales. Depending on what our final data looks like, it might be a good idea to use percentages instead of raw data. That way, the scale can be consistent as you hover over different states and see each state’s data.

Example Fusion maps:
Texas: Texas Fusion Table

Arkansas: Arkansas Fusion Table

Mississippi: Mississippi Fusion Table

Next Steps
Our next steps are to continue cleaning up our locations data. We need to finish this before we can have final numbers for number of times each state in our data set referenced other U.S. states. To make the data comparable across states and reduce the size of the data set, we will be eliminating pre-1835 ads from the results.

We will be revising our rough draft to ad more citations to back up the claims after we have hard numbers.

We will decide what tool we will use for creating our maps, whether that be Google Fusion Tables or Palladio. Both have their merits.

Rough Draft
Notes from Clare:

Over the past week, I have been going over slave advertisements from Texas and Mississippi in order to close-read and discover trends in geographical patterns or relationships. Based on the suggestions and on reading Team 1’s rough draft, I re-wrote the close-reading as a more general survey, eliminating many of the specific examples and consolidating information into about a paragraph for each state.

Rough Draft 2

Please comment on the rough draft!!

Update on TAPoR

After completing last week’s progress report, one of the questions we were left with is how the TAPoR Comparator calculates relative ratio. The documentation page does not specify where the relative count or the relative ratio come from, but a few trial calculations we able to lead us down the right path. We tested out numbers for “negro,” the most frequently occurring word in the Arkansas document from the Documenting Runaway Slaves Project project.

The results? The relative count equals the word count divided by the total number of words, so in this case, 920/80,690 for Arkansas, and 2,688/235,602 for Mississippi. Next, the relative ratio equals the Text 1 relative count divided by the Text 2 relative count, 0.0114/0.0114. Words that are relatively more frequent in Text 1 (AR) have a relative ratio value higher than 1, words that are relatively more frequent in Text 2 (MS) have a relative ratio value lower than 1, and words that are relatively equal have a value of 1. The relative ratio adjusts for document length and raw word counts to compare relative word frequencies. For example, even though “negro” has more than double the word count for Mississippi, the relative count for both AR and MS is ~0.0114. This places the relative ratio at 0.9994 – almost 1. (The reason this value is not exactly 1 is because the displayed relative counts get rounded off after the 4th decimal place. The relative counts for AR and MS are not actually precisely the same numbers down to the last decimal place).

So, Comparator balances the differences in document length between AR and MS to reveal that relatively, advertisements from the two states use the word “negro” with practically equal frequency. This sort of comparison could be useful for determining how language used to refer to the race of slaves does (or doesn’t) change across states. Similarly to TF-IDF, Comparator attempts to adjust for term frequency across documents to locate words that are more commonly occurring in one document compared to the rest of the corpus.

Now that we know how they both work, it would be interesting to compare our documents using both TAPoR’s Comparator and TF-IDF to see how the results differ. Here are the results for the word “negro” in Voyant’s TF-IDF option, recently added by Stefan Sinclair.
Again, AR and MS have very similar TF-IDF scores for the word “negro” despite MS’s raw word count being much higher.

You can view the raw word comparison output from TAPoR comparator at this webpage. You can also view the raw output from Voyant tools at this webpage.

Measuring Document Similarity and Comparing Corpora

This past week, Alyssa and I have been looking at ways to quantify similarity of documents. We are doing this in the context of comparing Texas runaway slave ads to runaway slave ads from other states. Thanks to the meticulous work of Dr. Max Grivno and Dr. Douglas Chambers in the Documenting Runaway Slaves project at the Southern Miss Department of History, we have at our disposal a sizable set of transcribed runaway slave ads from Arkansas and Mississippi that we will be able to experiment with. Since the transcriptions are not in the individual-document format needed to measure similarity, Franco will be using regex to split those corpora into their component advertisements.

The common method to measure document similarity is taking the cosine similarity of TF-IDF (term frequency–inverse document frequency) scores for words in each pair of documents. You can read more about how it works and how to implement it in this post by Jana Vembunarayanan at the blog Seeking Similarity. Essentially, term frequency values for each token (unique word) in a document are obtained by counting the occurrences of a word within that document, then those values are normalized by the inverse document frequency (IDF). The IDF is the log of the ratio of the total number of documents to the number of documents containing that word. Multiplying the term frequency by the inverse document frequency thus weights the term by how common it is in the rest of corpus. Words that occur in high frequency in a specific document but rarely in the rest of the corpus achieve high TF-IDF scores, while words that occur in lower frequency in a specific document but commonly in the rest of the corpus achieve high TF-IDF scores.

Using cosine similarity with TF-IDF seems to be the accepted way to compute pairwise document similarity, and as to not reinvent the wheel, we will probably use that method. That said, some creativity is needed to compare corpora as a wheel, rather than just two documents. For example, which corpora are most similar: Texas’s and Arkansas’s, Arkansas’s and Mississippi’s, or Texas’s and Mississippi’s? We could compute an average similarity of all pairs of documents in each pair of corpora.

Just as a side-note, if we solve the problem of automatically transcribing individual Texas runaway ads, we could use cosine similarity and TF-IDF to locate duplicate ads. Runaway slave ads were often posted multiple times in a newspaper, sometimes with minor differences between each printing of the advertisement (for example, in reward amount). We could classify pairs of documents with a cosine similarity score greater than a specified threshold as duplicates.

We could also use Named Entity Recognition to measure the similarity of corpora in terms of place-connectedness. Named Entity Recognition is a tool to discover and label words as places, names, companies, etc. Names might not be too helpful since, as far as I have been able to tell, slaves were usually identified just by a first name, but it would be interesting to see which corpora reference locations corresponding to another state. For example, there might be a runaway slave ad listed in the Telegraph and Texas Register in which a slave was thought to be heading northeast towards Little Rock, where he/she has family. The Arkansas corpus would undoubtedly have many ads with the term Little Rock. If there were a significant number of ads in Texas mentioning Arkansas places, or vice-versa, this is information we would want to capture to measure how connected the Texas and Arkansas corpora are.

Demo run of Stanford's Named Entity Tagger on an Arkansas runaway slave ad

A simple way we could quantify this measure of place-connectedness would start with a Named Entity Recognition list of tokens and what type of named entity they are (if any). Then we would iterate through all tokens and, if the token represents a location in another state in the corpus (perhaps the Google Maps API could be used?), increment the place-connectedness score for that pair of states.

We also explored other tools that can be used to compare text documents. In class, we have already looked at Voyant Tools, and now have been looking at other types of publicly available tools that can be used to compare documents side by side. TAPoR, is a useful resource that lets you browse and discover a huge collection of text analysis tools from around the web. It contains tools for comparing documents as well as for other kinds of text analysis. As we move forward with our project, TAPoR could definitely be a great resource for finding and experimenting with different tools that can be applied to our collection of runaway slave ads.

TAPoR provides a tool from TAPoRware called Comparator that analyzes two documents side by side to compare word counts and word ratios. We tested this tool on the Arkansas and Mississippi runaway advertisement collections. This sample comparison already yields interesting results, and gives an idea of how we could use word ratios to raise questions about runaway slave patterns across states.

These screenshots show a test run of the ads through the TAPoR comparator; the Arkansas ads are Text 1 and the Mississippi ads are Text 2. This comparison reveals that the words “Cherokee” and “Indians” have a high relative frequency for the Arkansas corpus, perhaps suggesting a higher rate of interaction between runaway slaves and Native Americans in Arkansas than in Mississippi. Click on a word of interest to get a snippet of the word in context. Upon looking into the full text of ads containing the word “Cherokee”, we find descriptions of slaves running away to live in the Cherokee nation, or running away in the company of Native Americans, slaves that were part Cherokee and could speak the language, or even one of a slave formerly being owned by a Cherokee.

However, after digging into the word ratios a little deeper, it turns out that uses of the word “Choctaw” and “Indian” are about even for Arkansas and Mississippi, so the states in the end may have similar patterns of runaway interaction with Native Americans. Nevertheless, this test of the Comparator gives us an idea of the sorts of questions it could help raise and answer when comparing advertisements. For example, many of us were curious if Texas runaway slaves ran away to Mexico or ran away with Mexicans. We could use this tool to look at ratios of the words “Mexico” or “Mexican” in Texas in comparison to other states.

Discovering Runaway Slave Ads

These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.

Fugitive Slave IconThus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:

(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).

OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.

So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.

Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. Sample background imageA typical image looked something like this.

Negative sample for training the HAAR cascadeAfter running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.

Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.

Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256

Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.

I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..

Next Steps:

  • Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
  • Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
  • If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
  • I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.

    Marked up Sudoku board using Hough Line Transform

    Sudoku hough line transform

    Hough Line Transform output

    Best we can do so far..

    If you have any tips or feedback, feel free to contact Franco (@FrancoBettati31) or me (@brawnstein) on Twitter, or leave a comment below. Thanks!