Today in class we stepped back for a moment to think about the various methods we could use to compare runaway ads from different states. Our current, still subject to change job is to build a site that would compare different methods for answering the question: "Were Texas runaway slave ads different from slave ads in other Southern states?"
Those methods may include:
- A traditional close reading of different states’ ads.
- Identifying words in each state’s corpus with high TF-IDF or relative frequency scores to see what the "outlier" words in each one may be.
- Using cosine similarity together with TF-IDF to rank the similarity between each state’s ads.
- Using "clustering" algorithms described by Franco (like those used in a spam filter) to see whether the computer can be trained to tell ads from different states apart.
- Having human users attempt to identify the origin of different ads (with giveaway words like state-specific place names removed with NER) to show how difficult (or comparatively easy) this is.
Among the challenges we discussed today was how to gauge whether these methods are successful, and whether they are likely to tell us anything surprising (or whether the lack of surprise may in fact help to build an argument about the uniformity of runaway slave ads across space). We also began a discussion of how we might "test" document similarity methods by, for example, training them with clearly different kinds of texts (jailors’ notices versus runaway ads?) or clearly similar texts (slightly modified reprinted ads? halves of the same state’s corpus?) to see how well they "work" at identifying similarity and difference.
We also noted, in light of Clare’s reminder that Carrigan emphasized that central Texas was the place where we are likely to see differences, and right now our Texas newspaper source is from urban east Texas, which may make it necessary to gather ads from another newspaper, like the Austin Gazette.
The good news is that we determined we do have a data set of Texas ads roughly comparable to the number of ads from the Arkansas Gazette. And we also came up with some other things we could do with our texts—like using NER to pull out place names and judge how connected different states’ corpora are, or to see how many male/female names are in the corpus—that might be possible outputs if the above methods don’t tell us anything particularly interesting.
Our next step seems to be to identify things we can do to prototype or test available methods—perhaps using available tools like Voyant first—with the lowest amount of work cost up front, so that we can determine which of the bigger projects are worth pursuing.
Before coming to class on Friday, please read the posts by your classmates and think about these issues. Another productive activity may be to read through ads we have collected and transcribed on our Google Doc, to see if salient topics or ideas occur to you that way.