Paper Machines Debriefing

I hope you enjoyed playing around with Paper Machines in our workshop with Jo Guldi. As promised, here’s a brief summary of how I constructed the corpus we used for our visualizations. I’ll follow that with some of the visualizations you made, and invite you to comment on what you see that’s of interest.

Building the Corpus

The best way to see the power of Paper Machines is to get a bunch of full-text sources in your Zotero library. There are many ways to do this; in the interest of time for this workshop, I chose one of the quickest.

If you had more time, you could download articles from JSTOR’s Data for Research, or you could use a program like Adobe Acrobat Pro to OCR some PDF files (which is necessary to turn page images into searchable text). I, on the other hand, wanted to find some documents that had already been OCR-ed and just get some plain text into a folder so we could play with it.

The first thing I wanted to do was identify a corpus of texts that had something in common already; as you know, I eventually settled on personal narratives about Civil War prisons and prisoners of war. To find these narratives, I first went to Hathi Trust and did a "subject" field search of "full view only" sources. That netted me these results. Using the Zotero button in my URL bar, I added several pages of these results to our RiceDH group. (Read more about adding items to a Zotero library.)

At this stage all I had was the metadata for the narratives—the publication date, title, author, and so on. I didn’t yet have the full texts themselves. To get those, I moved over to the Internet Archive, which contains OCR-ed text of many public domain books that have been digitized by Google or other organizations. All but a handful of the books I had entered from Hathi Trust had full-text files on the Internet Archive.

This was the only tedious part of the process. I would go to an item page like this one. Clicking on the "All files: HTTPS" link in the sidebar would get me here. Now I would right click and save the *.txt file to my Desktop. Then I would return to Zotero, and while selecting the relevant item I would use the attachment pull-down menu (it looks like a paper clip) to "Attach stored copy of file…" In the pop-up window, I would navigate to the text file on my Desktop, and choose to attach it to the bibliographic record in Zotero.1

Using Paper Machines

Once we had the texts in our Zotero group, Jo invited us to divide them into subcollections that seemed appropriate to us. Like "Andersonville" and "Not Andersonville," or "Escaped" and "Did Not Escape." Sorting the texts by date, I divided them into two subcollections called "Pre-1880" and "Post-1880."

Using just a few of the subcollections we had at the time, Jo made a topic model that looked like this, which found some interesting clusters that seem like they deal with escape attempts (“tunnels,” “hounds,” “capture” and so on):

Topic Model of Civil War Prison Narratives

I made some "phrase nets" for the pre-1880 and post-1880 subcollections. As I noted in class, what struck me about these images was the appearance of "blue and gray" and the disappearance of "white and black" in the post-1880 texts, which might be explained by David Blight’s hypothesis that Civil War memory by 1900 was concerned more with reconciliation and romantic stories about the blue and the gray than about the racial issues addressed by and during the war. The appearance of "met and conversed" after 1880 might also be significant in this regard, depending on the contexts in which the phrase was used:

"X and Y" Phrase Net for Pre-1880 Civil War Prison Narratives

"X and Y" Phrase Net for Post-1880 Civil War Prison Narratives

Using her own Zotero library of runaway slave ads from another project, Whitney made this word cloud, and informed us that it was interesting to see some words generally used mainly in Caribbean slave societies appearing in her Louisiana ads:

Whitney's Runaway Ads Word Cloud

Another one of you uploaded these other phrase nets for the Andersonville and non-Andersonville collections:

"X and Y" Phrase Net for Civil War Prison Narratives from the "Not Andersonville" Folder

"X and Y" Phrase Net for Civil War Prison Narratives in the "Andersonville" Folder

UPDATE:Here is another visualization of a phrase net submitted by Christina. It shows “X of the Y” nets in the “escape” folder.

Phrase Net of "X of the Y" from the Escape Folder

Phrase Net of "X of the Y" from the Escape Folder

If you still have a visualization that you’d like to upload to the group, let me know. I invite you to comment on the experience of playing with Paper Machines and looking at these images. What do you find useful about the text mining methods it uses? Do you notice other interesting patterns in the visualizations included here?

  1. You may be wondering why I went through the Hathi Trust step at all. I could have searched for Civil War prisoners and prisons within Internet Archive like this. But as you can see that search returned fewer results than at Hathi Trust, and some of them were not relevant. Hathi is generally more consistent about adding full metadata like subject fields to items in its database, but as a downside you can’t get already OCR-ed text there and sometimes can’t even get the full PDF. The takeaway lesson, though, is not to limit yourself to one database or search engine when doing your research.

12 Responses to Paper Machines Debriefing

  1. This was definitely a project for which I felt like it any coding ability on my part would’ve transformed what I could get out of it. The topic modeling option, in particular, is something I think will require some intense inspection before I really understand how the parameters work and what it means, from the program’s perspective, to change the parameters. Topic modeling is what I thought was by far the coolest opportunity in the program, though; concentrating all of that data into one image is a very very convenient tool, and would set out research ideas which would be more specific than what you could get from word clouds.

  2. Glad the topic modeling intrigued you, Charlie. I agree that to use it most effectively requires understanding what it’s actually doing, and I’ve been collecting some links on text mining and topic modeling that may be of interest to you. David Mimno’s talk at MITH on the technical side of topic modeling has also been highly recommended to me as a clear explanation of what’s involved, though I haven’t watched it yet.

  3. I wanted to go back to put up an interesting word cloud from the folder “family” that I put together but when I went back, the folder had been deleted. I do remember finding it really interesting that the only name that appeared in the word cloud was William, and that prison appeared in every word cloud (I did one that divided over time) except for one. I’m not sure what if anything can be gleaned from that sort of information, thought. The name William showing up a lot–does that mean that it was a really popular name for the generation of soldiers who fought the Civil War and lots of people talked about friends/family named William, or did one text mention the name enough to deserve a place in the word cloud? And that’s what I think is interesting about these visualizations–they seem to be good for the beginning stages of projects when you’re trying to figure out what directions might be good and which might be dead ends, but I’m not sure what their analytic value really is–I guess you’ll get out of it what you put into it. And on that note, I have a question: how expensive/time consuming is it to run lots of pdfs through OCR technology? I would love to be able to search for words in my newspaper pdfs (and the Paper Machines visualizations would be good too), but I’d like to be efficient with both money and time.

  4. After watching the video of Jo Guldi’s talk at MITH and seeing what she has done with Paper Machines, I am once again amazed by the possibilities of digital humanities and am beginning to feel like I should embrace the nuances as soon as I can. For instance, Quincy and I only figured out what “OCR”-ing a document meant at the Chad Black lecture meant, but knowing that for this week made my learning curve easier. However, not having the Zotero plug-in already makes me realize how I should be setting myself up now to explore whatever means of digital history I might be interested in later. Committing to understanding of GIS mapping and possibly learning python (although that seems incredibly intimidating) would be ways to that I can add value to my basic skill set.

  5. Jo created this amazing tool for a reason that even the most technophobic of historians can understand–sometimes there is simply too much to read. I was struck, however, that after she used Paper Machines to figure out where to direct her research energies, her research would proceed in a very standard manner–open a document, and read it. While I am very pleased to see that the training I’ve received in close reading has relevance in the digital age, I wonder if there are ways to use Paper Machines beyond just generating research questions. How might Paper Machines help give us good historical answers?

    I also want to pose the same question that I suspect I’ll have for every meeting in this course: How might this tool be used in the undergraduate classroom? I’ll ask Jo specifically if this is something she has considered and invite others to chime in with their thoughts as well.

  6. Wright Kennedy

    Kelly, off the top of my head I thought of three significant benefits of visualization tools. Visualization tools, like Paper Machines, allow you to examine a much bigger set of information, give you new perspectives on the data/sources, and make this information presentable (especially to a lay audience).
    To answer your question, yes, it is very easy to run an OCR on any number of PDFs or images. In Adobe Acrobat Pro (which should be on most lab computers; if you are having trouble finding a computer with the software, come to the GIS/Data Center), go to View>Tools>Recognize Text. From here select In Multiple Files. Select the files or folder with the PDFs or images you want to run the OCR on, and choose the output options. It’s as easy as that! Also, two weeks ago Chad Black showed us a python script which automated a similar process. Using scripts allows for complete customization of the processes.

    On to Paper Machines: I am really excited about the possibilities of the Paper Machines tool-set. Word clouds, in my opinion, are useful for presenting data. When comparing word frequencies, however, I find it more useful to examine the values in a more standardized format (e.g., a bar graph). Similarly, it is often useful and prudent to normalize the values (divide the word count by the total number of words in the document so that longer documents on a specific topic do not skew the results).

    The mapping functions of Paper Machines are especially interesting. The Flight Paths and Heat Map tools could be useful for a geographic overview of the place names mentioned in a text, but the Export Geodata to CVS function is the key tool for me. Jo mentioned to me that this function had just been added. The tool exports a file that can be opened as a spreadsheet (in MS Excel, for example). The file contains the place name, latitude & longitude, and a snippet of the text around the place name (15 characters before and after the geographic term). With this spreadsheet, these records can be imported into a GIS software, such as ArcGIS. The large library of spatial analysis tools in ArcGIS can then be used to analyze the data first extracted by Paper Machines.
    As a test, I ran the Export Geodata to CVS function on my Master’s Thesis (about the 1878 yellow fever epidemic in Memphis, TN), and the tool extracted 57 mentions of Memphis. Using the advanced search feature in Adobe Acrobat, however, I found 166 mentions of Memphis. So the engine still needs some tweaking, but it is a very promising tool.

  7. I looked at Paper Machines, and the possibilities of what you can do are really intriguing! I think it would be very interesting if people got interested in doing more independently after playing around with the tool, but I definitely realize that a lot of utility will be, as stated above, using the visualization as a foundation and proceeding from there.

    I really just enjoy data visualization immensely as a part of computer science because of how useful it is for every field. I definitely think that being able to glean information from a quick glance (even though, like in a word cloud, you might not get the specificity desired if you don’t tweak it a bit) is super useful and I can definitely see it appealing to undergraduates if some cool trends popped up that one might not have independently come up with!

  8. Hey everyone, sorry for not getting this up by last night. I’m not sure if the trackback is going to show up here, so I thought I’d link directly to a blog post of my own where I wrote some thoughts about how I might use Paper Machines.

    One of the main benefits I think would be for generating research questions for a comparative project like my dissertation, even if it was just analyzing secondary material like journal articles.

    Here’s the post:

  9. Christina Villarreal

    First of all, I cannot believe that I’ve managed to live without Zotero for so long. As a person who values efficiency, I am fascinated by the amount of time digital tools can save. I think programs like Paper Machine, beyond inspiring new research projects, can be a great aid for the classroom. In response to Ben’s question, I think the first steps towards using this type of digital history will be the usage of the visual representations. Perhaps this tool should be included in a course on research methods. I’m excited to see what comes next in digital history!

  10. Wright—I’m glad to hear that you’ve actually run a test on how reliable the Paper Machines process was at identifying locations to put into the Export Geodata to CVS function. Actually going in and manually checking the figures for a corpus is probably not going to be an option every time, but it’s good to see that one of us got an estimate of how well Paper Machines was working.

    I figure Christina is right, about visualizations being one of the main fruits of digital humanities that you’d use in teaching an undergrad course. I don’t know if anyone else is already familiar with this site or not—maybe Elizabeth—but one of my favorite places to get inspiration for data visualizations, going back to my time as an archi, is Many of the examples they’re showing here are trying to show a network in some way, so this is almost bringing us off topic from Paper Machines, and it’s kind of luck of the draw whether the creator of the visualization does a thorough job of explaining what they did. Sometimes they do have pretty good explanations, though, and definitely there’s a lot to get inspired by here.

  11. And Whitney—good call on tweeting about those illuminated manuscripts, that is a solid solid find.

  12. I should actually also note, for the record, that the Andersonville/Not Andersonville category/visualization is mine, but of course it doesn’t communicate terribly much. Only four documents mentioning Andersonville in the corpus means I don’t think we can draw any particularly useful conclusions from such a small sample size. The non-association of “wounded” with “captured” in the Andersonville documents, as compared with the association between those words in the other documents, is something I want to think is suggestive of high mortality rates among the wounded in the Civil War—not much discussion of prisoners having been wounded before being brought to Andersonville, I’m suspecting, because they may have been more likely to die than to survive long enough to be put in the prison camp. Something I guess I might follow up on by looking at a larger set of Andersonville-related narratives, though, if this was my field.