Update on TAPoR

After completing last week’s progress report, one of the questions we were left with is how the TAPoR Comparator calculates relative ratio. The documentation page does not specify where the relative count or the relative ratio come from, but a few trial calculations we able to lead us down the right path. We tested out numbers for “negro,” the most frequently occurring word in the Arkansas document from the Documenting Runaway Slaves Project project.

The results? The relative count equals the word count divided by the total number of words, so in this case, 920/80,690 for Arkansas, and 2,688/235,602 for Mississippi. Next, the relative ratio equals the Text 1 relative count divided by the Text 2 relative count, 0.0114/0.0114. Words that are relatively more frequent in Text 1 (AR) have a relative ratio value higher than 1, words that are relatively more frequent in Text 2 (MS) have a relative ratio value lower than 1, and words that are relatively equal have a value of 1. The relative ratio adjusts for document length and raw word counts to compare relative word frequencies. For example, even though “negro” has more than double the word count for Mississippi, the relative count for both AR and MS is ~0.0114. This places the relative ratio at 0.9994 – almost 1. (The reason this value is not exactly 1 is because the displayed relative counts get rounded off after the 4th decimal place. The relative counts for AR and MS are not actually precisely the same numbers down to the last decimal place).

So, Comparator balances the differences in document length between AR and MS to reveal that relatively, advertisements from the two states use the word “negro” with practically equal frequency. This sort of comparison could be useful for determining how language used to refer to the race of slaves does (or doesn’t) change across states. Similarly to TF-IDF, Comparator attempts to adjust for term frequency across documents to locate words that are more commonly occurring in one document compared to the rest of the corpus.

Now that we know how they both work, it would be interesting to compare our documents using both TAPoR’s Comparator and TF-IDF to see how the results differ. Here are the results for the word “negro” in Voyant’s TF-IDF option, recently added by Stefan Sinclair.
Again, AR and MS have very similar TF-IDF scores for the word “negro” despite MS’s raw word count being much higher.

You can view the raw word comparison output from TAPoR comparator at this webpage. You can also view the raw output from Voyant tools at this webpage.

One Response to Update on TAPoR

  1. Great explanation! As I think you’ve noticed before, it seems from the results that Mississippi has a lot more words having to do with captured runaways (like “jail,” “requested,” “committed,” “says,” “prove,” “dealt”), and the fact that the relative ratios for “reward” and “subscriber” are under 1 seems to further confirm that there are more runaway ads in Arkansas relative to jailers’ notices.

    That raises an interesting question of why this is so, a question that may require some additional research. For example, did Mississippi law require jailers’ notices to be posted (similar to Texas) whereas Arkansas did not? (The word “law” is also more common in the Mississippi texts.) The difference could come down to the law, in this case. And if the law was in fact different, that would offer some useful confirmation that your text mining method found a meaningful pattern—that is, it confirmed something we would have expected anyway.

    This is useful insofar as it would then build confidence for making arguments and asking questions about other, perhaps less obvious differences between the texts. For example, why is “boy” more common in the Mississippi ads? Was this a word more likely to be used in jailor’s notices than in runaway ads—that is, was it a word that public officials were more likely to use to describe a slave than slaveowners were?