Category Archives: Voyant

Measuring Document Similarity and Comparing Corpora

This past week, Alyssa and I have been looking at ways to quantify similarity of documents. We are doing this in the context of comparing Texas runaway slave ads to runaway slave ads from other states. Thanks to the meticulous work of Dr. Max Grivno and Dr. Douglas Chambers in the Documenting Runaway Slaves project at the Southern Miss Department of History, we have at our disposal a sizable set of transcribed runaway slave ads from Arkansas and Mississippi that we will be able to experiment with. Since the transcriptions are not in the individual-document format needed to measure similarity, Franco will be using regex to split those corpora into their component advertisements.

The common method to measure document similarity is taking the cosine similarity of TF-IDF (term frequency–inverse document frequency) scores for words in each pair of documents. You can read more about how it works and how to implement it in this post by Jana Vembunarayanan at the blog Seeking Similarity. Essentially, term frequency values for each token (unique word) in a document are obtained by counting the occurrences of a word within that document, then those values are normalized by the inverse document frequency (IDF). The IDF is the log of the ratio of the total number of documents to the number of documents containing that word. Multiplying the term frequency by the inverse document frequency thus weights the term by how common it is in the rest of corpus. Words that occur in high frequency in a specific document but rarely in the rest of the corpus achieve high TF-IDF scores, while words that occur in lower frequency in a specific document but commonly in the rest of the corpus achieve high TF-IDF scores.

Using cosine similarity with TF-IDF seems to be the accepted way to compute pairwise document similarity, and as to not reinvent the wheel, we will probably use that method. That said, some creativity is needed to compare corpora as a wheel, rather than just two documents. For example, which corpora are most similar: Texas’s and Arkansas’s, Arkansas’s and Mississippi’s, or Texas’s and Mississippi’s? We could compute an average similarity of all pairs of documents in each pair of corpora.

Just as a side-note, if we solve the problem of automatically transcribing individual Texas runaway ads, we could use cosine similarity and TF-IDF to locate duplicate ads. Runaway slave ads were often posted multiple times in a newspaper, sometimes with minor differences between each printing of the advertisement (for example, in reward amount). We could classify pairs of documents with a cosine similarity score greater than a specified threshold as duplicates.

We could also use Named Entity Recognition to measure the similarity of corpora in terms of place-connectedness. Named Entity Recognition is a tool to discover and label words as places, names, companies, etc. Names might not be too helpful since, as far as I have been able to tell, slaves were usually identified just by a first name, but it would be interesting to see which corpora reference locations corresponding to another state. For example, there might be a runaway slave ad listed in the Telegraph and Texas Register in which a slave was thought to be heading northeast towards Little Rock, where he/she has family. The Arkansas corpus would undoubtedly have many ads with the term Little Rock. If there were a significant number of ads in Texas mentioning Arkansas places, or vice-versa, this is information we would want to capture to measure how connected the Texas and Arkansas corpora are.

Demo run of Stanford's Named Entity Tagger on an Arkansas runaway slave ad

A simple way we could quantify this measure of place-connectedness would start with a Named Entity Recognition list of tokens and what type of named entity they are (if any). Then we would iterate through all tokens and, if the token represents a location in another state in the corpus (perhaps the Google Maps API could be used?), increment the place-connectedness score for that pair of states.

We also explored other tools that can be used to compare text documents. In class, we have already looked at Voyant Tools, and now have been looking at other types of publicly available tools that can be used to compare documents side by side. TAPoR, is a useful resource that lets you browse and discover a huge collection of text analysis tools from around the web. It contains tools for comparing documents as well as for other kinds of text analysis. As we move forward with our project, TAPoR could definitely be a great resource for finding and experimenting with different tools that can be applied to our collection of runaway slave ads.

TAPoR provides a tool from TAPoRware called Comparator that analyzes two documents side by side to compare word counts and word ratios. We tested this tool on the Arkansas and Mississippi runaway advertisement collections. This sample comparison already yields interesting results, and gives an idea of how we could use word ratios to raise questions about runaway slave patterns across states.

These screenshots show a test run of the ads through the TAPoR comparator; the Arkansas ads are Text 1 and the Mississippi ads are Text 2. This comparison reveals that the words “Cherokee” and “Indians” have a high relative frequency for the Arkansas corpus, perhaps suggesting a higher rate of interaction between runaway slaves and Native Americans in Arkansas than in Mississippi. Click on a word of interest to get a snippet of the word in context. Upon looking into the full text of ads containing the word “Cherokee”, we find descriptions of slaves running away to live in the Cherokee nation, or running away in the company of Native Americans, slaves that were part Cherokee and could speak the language, or even one of a slave formerly being owned by a Cherokee.

However, after digging into the word ratios a little deeper, it turns out that uses of the word “Choctaw” and “Indian” are about even for Arkansas and Mississippi, so the states in the end may have similar patterns of runaway interaction with Native Americans. Nevertheless, this test of the Comparator gives us an idea of the sorts of questions it could help raise and answer when comparing advertisements. For example, many of us were curious if Texas runaway slaves ran away to Mexico or ran away with Mexicans. We could use this tool to look at ratios of the words “Mexico” or “Mexican” in Texas in comparison to other states.

Some Text Mining Resources

Today in class I briefly mentioned TF-IDF (Term Frequency-Inverse Document Frequency) as a possible way for us to identify "give away" words that might appear more frequently in a particular document. Here are some introductory explanations of the method:

And here’s a cool visualization experiment using TF-IDF made by Tim Sherratt, who also made the Real Face of White Australia and Headline Roulette sites shown in class today.

I also mentioned Named Entity Recognition in class; this is the same library used by the Rezo Viz tool that Daniel and Alyssa showed us in their Voyant Tools presentation. It may be possible for us simply to use Voyant as an interface for NER and export a list of place and person names from our ads, but we need to look into this further.

Slides from Tool Presentations

Thanks for the great job that you all did on your presentations about digital tools that might be helpful for our project with runaway slave ads! I’m posting here the slides that were shown in class so that we can reference them. Click the image to get the whole PDF.

First, Alyssa and Daniel talked with us about Voyant Tools:

Clare and Kaitlyn talked about using Google Maps and Google Fusion Tables, together with Social Explorer:

Thanks for sharing!

Using Voyant Tools for Runaway Ads

I’ve been using the site Voyant Tools to look at the text content of runaway ads.
In a nutshell, the site pulls all the words, and finds their frequencies and trends.  It displays them in a variety of ways, which I’ll show with its analysis of 550 pages of Mississippi slave ads.

Without screenshots, you can view the results through this link (one feature is the enabled url and unique ID for data sets, which allows re-linking and comparing between documents).

Features include Cirrus- a basic word cloud, numerical data for the appearances of words in the corpus, the option to see each appearance in context, and Trends- a tool that visually maps out the relative frequency of the word throughout the course of the document.

This last tool is the most interesting to me, as in chronologically ordered ad sets, it gives you an immediate look at the relative usage of the term over time.  For example, the use of 1836 has one remarkable spike in usage over the course of several decades… We can use this to track usage of racial descriptive terms over time, or similar word-based information.

Through the incorporation of numerous corpuses of information, we can also compare word usage in different states and areas.  I can see how this will be helpful in the future in answering some of our questions regarding how Texas runaways and situations were different from those in the rest of the south.

HW#5: Thoughts and Progress on Voyant

For the group presentations, I’ve been working with the tool Voyant, which does text analysis on one or more documents. Among its tools, it generates a word cloud of most frequent words, generates graphs of word frequency across the corpus, and lets you compare multiple documents. Once you have a text uploaded, you can play around a lot within the Voyant “skin”, opening and closing different tools, or clicking on a particular word to see trends for that word specifically. It’s also possible to generate a link to the skin that can then be shared with others, allowing them to then play around with the data on their own. I think this interactive feature could potentially be really useful, since it lets anyone who is curious take a look at the data and track key words in pursuit of whatever questions they might be interested in.

Just as an example of what using the Voyant tools looks like, this screenshot shows Shakespeare’s works (Voyant’s sample corpus).

Right now I have the word “king” selected, allowing me to see specific information about the word such as where in the corpus the word appears, frequencies of the word over time, and the word in context.

To apply Voyant specifically to runaway slave ads, Daniel and I looked at transcribed documents of runaway slave ads from Mississippi and Arkansas (PDF’s available from Documenting Runaway Slaves Project). I looked at the Arkansas ads, splitting the corpus up in two different ways. First, I split the document up by decade and then a single document of the ads from 1820-1865. (note: to turn off common stop words such as “and” “the”, click the gear icon and choose English for list of stop words) Splitting the ads up by decade could potentially make it easier to track changes over time, although since the original document was already ordered chronologically this is also possible to do with the single document. Another possibility we talked about in class is splitting up runaway ads into individual documents, making it possible to compare specific ads, rather than time clumps.

During class, Daniel and I combined the Arkansas and Mississippi documents to do a side-by-side comparison of the two states. Not surprisingly, “Arkansas” is a distinctive word in the Arkansas documents, but with other words such as “sheriff” or “committed” it could be interesting to dig down deeper and figure out why those differences exist. Are these merely linguistic/word choice differences, or do they indicate a difference in runaway patterns? These are the sorts of questions which Voyant raises, but can also help answer, with tools such as keywords in context.

I was interested in comparing the work we’d already done on Mississippi and Arkansas to some of the Texas ads we’ve collected in the Telegraph and Texas Register. I transcribed Texas ads from 1837 (excluding reprints) and compared that with Mississippi and Arkansas ads from 1837. The sample from Texas is small, so I would be hesitant to draw grand conclusions from this comparison, but it’s a good place to start addressing the questions many of us were interested in about what difference Texas makes (if any) in runaway patterns. Here are the results of all three states for 1837. Looking forward, I’m interested in looking at these results more closely to see if they raise interesting questions regarding Texas. This can help us answer questions about whether or not it’s worthwhile to continue transcribing Texas ads (and if so, how many), and how to split up the data (by year, by individual advertisement?).

The main downside to using Voyant so far is the same issue we ran into with Mallett: the Telegraph and Texas Register advertisements are not available individually in text format. This is not so much a limitation with Voyant itself as it is with the medium of primary source documents we are working with. It does seem at this point that Voyant could be a useful tool, but if we as a class decide to use Voyant for our project in the future, we’ll have to think of ways to get around that obstacle.