Category Archives: Group Projects

Final Project To-Do List

I’ve put the to-do list that we made in class yesterday in the README file for our Drafts repo. For the next week, do all of your work directly on the Github files. Remember to "commit" your changes so you don’t lose work! I recommend drafting your work in a text editor and saving copies on your machine just in case.

Note that if you encounter a message while editing saying that someone else has changed the file while you worked, you should see a red banner message at the top prompting you to review the other person’s changes. Click on the review changes link in that banner, and you should see the new text appear in your editing window along with yours. Now you can commit the file incorporating your changes.

Final Project Thoughts

Thanks to both groups for the informative presentations in class today. Now that we all know a little bit more about where each part of the project stands, we need to make some rapid decisions about what we want our final product to look like by the end of classes.

Continue reading

Progress Report: GeoTeam

Since our last progress report, we have completed the following tasks:

  • Clare revised the rough draft for the close reading essay. You can view the new draft at the bottom of the post.
  • Aaron revised locations_tag.py to merge location entities that are in close proximity in the ad. For example, the raw results of NER for “Sheriff of Pulaski County, Arkansas” are “Pulaski County” and “Arkansas”. The script would convert those terms into a single expression, “Pulaski County, Arkansas”. This makes it easier to generate geo-coordinates for the referenced locations and to trim down the amount of location results. Additionally, we were having problems with incomplete results due to the word “County” being spelled in lowercase and abbreviated. The new version of the script pre-processes the text files to find/replace words such as these.
  • Aaron wrote a script count_states.py to convert the output from locations_tag.py into a mapping between each ad and the states referenced in that ad. It will be used to tally the number of references to each other state in our Texas, Mississippi, and Arkansas datasets.
  • Kaitlyn has been working on example maps using Google Fusion Tables. To generate state counts, she used the Find feature of her text editor to count number of occurrences of known state names (but not initials). Once we have more accurate numbers when count_states.py is extended in functionality, we will be able to create a more accurate map.
  • Kaitlyn also test-drove Palladio. The following is her comments on it:

I was able to take a look at what Palladio has to offer for us, and I think it could be a really interesting tool because of the “point to point” mapping abilities. I quickly learned how to upload spreadsheets to Palladio and extend spreadsheets to certain variables. For example, I created a spreadsheet with columns “Year Ad Published,” “Slave Name,” “Owner Name,” “Owner Location,” “Runaway Location,” “Projected Location,” and “Permalink” and was able to link all of the location variables to a spreadsheet that contained coordinates for each place. Then, using the Palladio mapping tool, I was able to create a map that connected the Runaway Locations to the Projected Locations for each advertisement. Although I only have a few points right now, one can see how this tool could be useful for looking at how connected different places are to each other. If we want to use Palladio, we will need to start expanding the spreadsheet, which is time consuming because it requires manually inputting data. I think Palladio could be a useful tool for showing some of the outliers in our advertisement corpora.

Her comments on creating the fusion tables:

Using basic search functions, I have been taking the data that Aaron collected by running the ads through his tagging script and counting how many times state names are mentioned in each of the state corpora (I have been searching only for whole names right now; eg: “Texas” and not “TX” or “TEX”). This enables me to get a sense of what the google fusion table maps will look like with real data. The main issue that I have come across in doing this is coming up with a scale that will work across the Texas, Arkansas, and Mississippi ads. Because Arkansas and Mississippi have so many more ads than Texas, there is no way right now to line up the scales. Depending on what our final data looks like, it might be a good idea to use percentages instead of raw data. That way, the scale can be consistent as you hover over different states and see each state’s data.

Example Fusion maps:
Texas: Texas Fusion Table

Arkansas: Arkansas Fusion Table

Mississippi: Mississippi Fusion Table

Next Steps
Our next steps are to continue cleaning up our locations data. We need to finish this before we can have final numbers for number of times each state in our data set referenced other U.S. states. To make the data comparable across states and reduce the size of the data set, we will be eliminating pre-1835 ads from the results.

We will be revising our rough draft to ad more citations to back up the claims after we have hard numbers.

We will decide what tool we will use for creating our maps, whether that be Google Fusion Tables or Palladio. Both have their merits.

Rough Draft
Notes from Clare:

Over the past week, I have been going over slave advertisements from Texas and Mississippi in order to close-read and discover trends in geographical patterns or relationships. Based on the suggestions and on reading Team 1’s rough draft, I re-wrote the close-reading as a more general survey, eliminating many of the specific examples and consolidating information into about a paragraph for each state.

Rough Draft 2

Please comment on the rough draft!!

Team Assignments

As discussed yesterday in class, we are going to split up into new teams to begin working on our final web project.

Both teams will contribute two things to the final project:

  • A webpage that contains an introduction to the question, a step-by-step section discussing different methods you tried to answer the question, and a summary of findings and questions for future research.
  • At least one non-narrative visualization illustrating some of the team’s findings.
  • A brief, traditional historical essay that answers the team’s question using a close reading of the available sources. These will be combined together on one page separate from the digital methods reports.

Your team should not only try the methods identified below to answer the question, but also reflect and report on whether these methods actually help us answer the question posed. You should also not take for granted that the corpora we have are already suitably prepared for the methods you want to try; your team may need to think through how to turn the transcriptions and metadata we’ve collected into datasets that are actually susceptible to the kinds of analysis you want to try.

Team 1

Alyssa and Daniel

Question

How similar were Texas ads to ads from the nearby states of Mississippi and Arkansas?

Method

Use text mining methods, such as word trends (in Voyant), TF-IDF, and topic modeling, to compare corpora from Texas, Mississippi and Arkansas.

Team 2

Aaron, Clare, and Kaitlyn

Question

Judging from runaway ads, how were Texas, Mississippi, Arkansas, and Louisiana connected geographically in the antebellum period?

Method

Use Named Entity Recognition to extract place names from the ad corpora and then try different methods (place count tables and graphs, Google maps, network graphs) to visualize how often places from one state were mentioned in another state’s ads.

Deadlines for Progress Reports

All progress reports from here on out will be due before class begins.

  • March 31: Report should include schedule for team’s tasks and work and initial delegation of tasks among team members
  • April 7: Report should include a draft of the "close reading" essay required for the final project, as well as update on other tasks
  • April 14 and 21: Report on progress toward final webpage

Got questions? Leave comments!

Measuring Document Similarity and Comparing Corpora

This past week, Alyssa and I have been looking at ways to quantify similarity of documents. We are doing this in the context of comparing Texas runaway slave ads to runaway slave ads from other states. Thanks to the meticulous work of Dr. Max Grivno and Dr. Douglas Chambers in the Documenting Runaway Slaves project at the Southern Miss Department of History, we have at our disposal a sizable set of transcribed runaway slave ads from Arkansas and Mississippi that we will be able to experiment with. Since the transcriptions are not in the individual-document format needed to measure similarity, Franco will be using regex to split those corpora into their component advertisements.

The common method to measure document similarity is taking the cosine similarity of TF-IDF (term frequency–inverse document frequency) scores for words in each pair of documents. You can read more about how it works and how to implement it in this post by Jana Vembunarayanan at the blog Seeking Similarity. Essentially, term frequency values for each token (unique word) in a document are obtained by counting the occurrences of a word within that document, then those values are normalized by the inverse document frequency (IDF). The IDF is the log of the ratio of the total number of documents to the number of documents containing that word. Multiplying the term frequency by the inverse document frequency thus weights the term by how common it is in the rest of corpus. Words that occur in high frequency in a specific document but rarely in the rest of the corpus achieve high TF-IDF scores, while words that occur in lower frequency in a specific document but commonly in the rest of the corpus achieve high TF-IDF scores.

Using cosine similarity with TF-IDF seems to be the accepted way to compute pairwise document similarity, and as to not reinvent the wheel, we will probably use that method. That said, some creativity is needed to compare corpora as a wheel, rather than just two documents. For example, which corpora are most similar: Texas’s and Arkansas’s, Arkansas’s and Mississippi’s, or Texas’s and Mississippi’s? We could compute an average similarity of all pairs of documents in each pair of corpora.

Just as a side-note, if we solve the problem of automatically transcribing individual Texas runaway ads, we could use cosine similarity and TF-IDF to locate duplicate ads. Runaway slave ads were often posted multiple times in a newspaper, sometimes with minor differences between each printing of the advertisement (for example, in reward amount). We could classify pairs of documents with a cosine similarity score greater than a specified threshold as duplicates.

We could also use Named Entity Recognition to measure the similarity of corpora in terms of place-connectedness. Named Entity Recognition is a tool to discover and label words as places, names, companies, etc. Names might not be too helpful since, as far as I have been able to tell, slaves were usually identified just by a first name, but it would be interesting to see which corpora reference locations corresponding to another state. For example, there might be a runaway slave ad listed in the Telegraph and Texas Register in which a slave was thought to be heading northeast towards Little Rock, where he/she has family. The Arkansas corpus would undoubtedly have many ads with the term Little Rock. If there were a significant number of ads in Texas mentioning Arkansas places, or vice-versa, this is information we would want to capture to measure how connected the Texas and Arkansas corpora are.

Demo run of Stanford's Named Entity Tagger on an Arkansas runaway slave ad

A simple way we could quantify this measure of place-connectedness would start with a Named Entity Recognition list of tokens and what type of named entity they are (if any). Then we would iterate through all tokens and, if the token represents a location in another state in the corpus (perhaps the Google Maps API could be used?), increment the place-connectedness score for that pair of states.

We also explored other tools that can be used to compare text documents. In class, we have already looked at Voyant Tools, and now have been looking at other types of publicly available tools that can be used to compare documents side by side. TAPoR, is a useful resource that lets you browse and discover a huge collection of text analysis tools from around the web. It contains tools for comparing documents as well as for other kinds of text analysis. As we move forward with our project, TAPoR could definitely be a great resource for finding and experimenting with different tools that can be applied to our collection of runaway slave ads.

TAPoR provides a tool from TAPoRware called Comparator that analyzes two documents side by side to compare word counts and word ratios. We tested this tool on the Arkansas and Mississippi runaway advertisement collections. This sample comparison already yields interesting results, and gives an idea of how we could use word ratios to raise questions about runaway slave patterns across states.

These screenshots show a test run of the ads through the TAPoR comparator; the Arkansas ads are Text 1 and the Mississippi ads are Text 2. This comparison reveals that the words “Cherokee” and “Indians” have a high relative frequency for the Arkansas corpus, perhaps suggesting a higher rate of interaction between runaway slaves and Native Americans in Arkansas than in Mississippi. Click on a word of interest to get a snippet of the word in context. Upon looking into the full text of ads containing the word “Cherokee”, we find descriptions of slaves running away to live in the Cherokee nation, or running away in the company of Native Americans, slaves that were part Cherokee and could speak the language, or even one of a slave formerly being owned by a Cherokee.

However, after digging into the word ratios a little deeper, it turns out that uses of the word “Choctaw” and “Indian” are about even for Arkansas and Mississippi, so the states in the end may have similar patterns of runaway interaction with Native Americans. Nevertheless, this test of the Comparator gives us an idea of the sorts of questions it could help raise and answer when comparing advertisements. For example, many of us were curious if Texas runaway slaves ran away to Mexico or ran away with Mexicans. We could use this tool to look at ratios of the words “Mexico” or “Mexican” in Texas in comparison to other states.

Progress Report #1 Tasks

As indicated on the syllabus, your first Progress Report on our class project is due this Monday, March 17, by the end of class. The progress report should take the form of a correctly formatted, hyperlink-rich post to this blog. Each group needs to make only one post, but you should work together on the post and will be assigned a grade on the report as a group. Note that the report needs to show your progress, even if you haven’t yet completed all the tasks assigned to you. The groups/tasks we assigned last Monday are as follows, but keep in mind that groups and tasks will shift as we move forward.

Continue reading

Slides from Tool Presentations

Thanks for the great job that you all did on your presentations about digital tools that might be helpful for our project with runaway slave ads! I’m posting here the slides that were shown in class so that we can reference them. Click the image to get the whole PDF.

First, Alyssa and Daniel talked with us about Voyant Tools:

Clare and Kaitlyn talked about using Google Maps and Google Fusion Tables, together with Social Explorer:

Thanks for sharing!

Group Presentation Schedule

  • Monday, 2pm: Alyssa and Daniel
  • Monday, 2:25pm: Aaron (and Franco)
  • Wednesday, 2pm: Clare and Kaitlyn

Discovering Runaway Slave Ads

These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.

Fugitive Slave IconThus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:

(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).

OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.

So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.

Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. Sample background imageA typical image looked something like this.

Negative sample for training the HAAR cascadeAfter running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.

Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.

Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256

Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.

I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..

Next Steps:

  • Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
  • Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
  • If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
  • I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.

    Marked up Sudoku board using Hough Line Transform

    Sudoku hough line transform

    Hough Line Transform output

    Best we can do so far..

    If you have any tips or feedback, feel free to contact Franco (@FrancoBettati31) or me (@brawnstein) on Twitter, or leave a comment below. Thanks!

Group Presentations

In the next two weeks of class, we will divide our labor so that we can learn about some different kinds of digital tools that might help us answer (or more effectively present our answers) to our questions about slavery and runaway slave ads in Texas.

You will work with a partner to work through some tutorials (much like you did for Homework #3), and then talk with your partner about how this tool (or others like it) might be useful for our class. Your final task will then be to report back to the class on what you have done with an oral presentation that gives your classmates a sense of what the tool can do and what it might do for us.

Continue reading