After completing last week’s progress report, one of the questions we were left with is how the TAPoR Comparator calculates relative ratio. The documentation page does not specify where the relative count or the relative ratio come from, but a few trial calculations we able to lead us down the right path. We tested out numbers for “negro,” the most frequently occurring word in the Arkansas document from the Documenting Runaway Slaves Project project.
The results? The relative count equals the word count divided by the total number of words, so in this case, 920/80,690 for Arkansas, and 2,688/235,602 for Mississippi. Next, the relative ratio equals the Text 1 relative count divided by the Text 2 relative count, 0.0114/0.0114. Words that are relatively more frequent in Text 1 (AR) have a relative ratio value higher than 1, words that are relatively more frequent in Text 2 (MS) have a relative ratio value lower than 1, and words that are relatively equal have a value of 1. The relative ratio adjusts for document length and raw word counts to compare relative word frequencies. For example, even though “negro” has more than double the word count for Mississippi, the relative count for both AR and MS is ~0.0114. This places the relative ratio at 0.9994 – almost 1. (The reason this value is not exactly 1 is because the displayed relative counts get rounded off after the 4th decimal place. The relative counts for AR and MS are not actually precisely the same numbers down to the last decimal place).
So, Comparator balances the differences in document length between AR and MS to reveal that relatively, advertisements from the two states use the word “negro” with practically equal frequency. This sort of comparison could be useful for determining how language used to refer to the race of slaves does (or doesn’t) change across states. Similarly to TF-IDF, Comparator attempts to adjust for term frequency across documents to locate words that are more commonly occurring in one document compared to the rest of the corpus.
Now that we know how they both work, it would be interesting to compare our documents using both TAPoR’s Comparator and TF-IDF to see how the results differ. Here are the results for the word “negro” in Voyant’s TF-IDF option, recently added by Stefan Sinclair.
Again, AR and MS have very similar TF-IDF scores for the word “negro” despite MS’s raw word count being much higher.