I hope you enjoyed playing around with Paper Machines in our workshop with Jo Guldi. As promised, here’s a brief summary of how I constructed the corpus we used for our visualizations. I’ll follow that with some of the visualizations you made, and invite you to comment on what you see that’s of interest.
Building the Corpus
The best way to see the power of Paper Machines is to get a bunch of full-text sources in your Zotero library. There are many ways to do this; in the interest of time for this workshop, I chose one of the quickest.
If you had more time, you could download articles from JSTOR’s Data for Research, or you could use a program like Adobe Acrobat Pro to OCR some PDF files (which is necessary to turn page images into searchable text). I, on the other hand, wanted to find some documents that had already been OCR-ed and just get some plain text into a folder so we could play with it.
The first thing I wanted to do was identify a corpus of texts that had something in common already; as you know, I eventually settled on personal narratives about Civil War prisons and prisoners of war. To find these narratives, I first went to Hathi Trust and did a "subject" field search of "full view only" sources. That netted me these results. Using the Zotero button in my URL bar, I added several pages of these results to our RiceDH group. (Read more about adding items to a Zotero library.)
At this stage all I had was the metadata for the narratives—the publication date, title, author, and so on. I didn’t yet have the full texts themselves. To get those, I moved over to the Internet Archive, which contains OCR-ed text of many public domain books that have been digitized by Google or other organizations. All but a handful of the books I had entered from Hathi Trust had full-text files on the Internet Archive.
This was the only tedious part of the process. I would go to an item page like this one. Clicking on the "All files: HTTPS" link in the sidebar would get me here. Now I would right click and save the
*.txt file to my Desktop. Then I would return to Zotero, and while selecting the relevant item I would use the attachment pull-down menu (it looks like a paper clip) to "Attach stored copy of file…" In the pop-up window, I would navigate to the text file on my Desktop, and choose to attach it to the bibliographic record in Zotero.1
Using Paper Machines
Once we had the texts in our Zotero group, Jo invited us to divide them into subcollections that seemed appropriate to us. Like "Andersonville" and "Not Andersonville," or "Escaped" and "Did Not Escape." Sorting the texts by date, I divided them into two subcollections called "Pre-1880" and "Post-1880."
Using just a few of the subcollections we had at the time, Jo made a topic model that looked like this, which found some interesting clusters that seem like they deal with escape attempts (“tunnels,” “hounds,” “capture” and so on):
I made some "phrase nets" for the pre-1880 and post-1880 subcollections. As I noted in class, what struck me about these images was the appearance of "blue and gray" and the disappearance of "white and black" in the post-1880 texts, which might be explained by David Blight’s hypothesis that Civil War memory by 1900 was concerned more with reconciliation and romantic stories about the blue and the gray than about the racial issues addressed by and during the war. The appearance of "met and conversed" after 1880 might also be significant in this regard, depending on the contexts in which the phrase was used:
Using her own Zotero library of runaway slave ads from another project, Whitney made this word cloud, and informed us that it was interesting to see some words generally used mainly in Caribbean slave societies appearing in her Louisiana ads:
Another one of you uploaded these other phrase nets for the Andersonville and non-Andersonville collections:
UPDATE:Here is another visualization of a phrase net submitted by Christina. It shows “X of the Y” nets in the “escape” folder.
If you still have a visualization that you’d like to upload to the group, let me know. I invite you to comment on the experience of playing with Paper Machines and looking at these images. What do you find useful about the text mining methods it uses? Do you notice other interesting patterns in the visualizations included here?
You may be wondering why I went through the Hathi Trust step at all. I could have searched for Civil War prisoners and prisons within Internet Archive like this. But as you can see that search returned fewer results than at Hathi Trust, and some of them were not relevant. Hathi is generally more consistent about adding full metadata like subject fields to items in its database, but as a downside you can’t get already OCR-ed text there and sometimes can’t even get the full PDF. The takeaway lesson, though, is not to limit yourself to one database or search engine when doing your research.↩