Author Archives: krs4

Geography Team Progress Report 4/7

Upon getting to work and trying to follow our schedule, we have realized that we planned to have too many things due in a very short period of time. We are in the process of adjusting the schedule and replanning what needs to be done. However, the three of us have been working on a few different tasks during the break.

Clare has been working on a draft of the close reading. Her rough draft includes an introduction, analysis of Arkansas, discussion of the advantages of the digital, and conclusion. The analysis of Arkansas will act as a prototype of what she plans to do with the conclusions she is reaching about Mississippi and Texas, although her data collection requires more time than previously realized. She plans on diversifying advertisement examples, as her current examples are from a few select years. We will discuss suggestions for the progress of the essay with Dr. McDaniel.

Aaron wrote a python script called that tags locations in each advertisement. He ran the cleaned advertisements from the two Texas newspapers and the Mississippi and Arkansas corpora through the script and saved them as JSON files. I then started to try to run the tagged locations through GeoNamesMatch, but I quickly ran into some difficulties. After discussing with Aaron, we decided that the input and output of this particular program was inconvenient for what we are trying to do. Aaron played around with Google’s free geocoding API (using the Python library Geopy) and had some success with it, so we have decided to use that instead. Aaron and I then started cleaning up the pretty printed JSON of the tagged locations, and we realized that even though we don’t have to correct spelling or extend state abbreviations, this task is going to take a very long time because of the large number of advertisements we have, especially in Mississippi. Our original plan was to compare the output of NER to the actual advertisement–essentially just using the NER results as a footing for the actual list of locations–but due to the large amount of data and the limited amount of time left in the semester, that might be infeasible.

Next Steps

Through cleaning the tagged locations, we noticed that the python script has been separating locations that should be together. Some results come out as [Travis County], [Texas] instead of [Travis County Texas], or even as [County] instead of [Travis County]. Additionally, we noticed that NER misses county names when the word “County” appears lowercase, so before we run the script again we will fix the capitalization in our input files. It is unlikely that we will ever be able to write a script that catches every location with precision, but we would like to be as close as we can get.

Thus, Aaron is planning on revising so that it does not split up the county or city name from the state name, perhaps by setting a threshold for the gap between each match in the text for the results to be considered distinct entities. Once that is done, he will rerun the advertisement corpora through the new script, and then he and Kaitlyn will begin cleaning the results. We will need to come up with a few parameters or rules for cleaning the results so that there is consistency across the states. We will also need to decide if we should compare the results for each advertisement to the original text. That could be a very time consuming process, so we may choose to compare a subset of the entire results to the original advertisements, or reduce the number of advertisements overall for which we will produce data.

Even though we will have to reclean the results, we can still use the current cleaned up results from Texas to start thinking about how we want to visualize our results. Right now, we are planning on looking at Palladio to see if it will fit our needs. We also have been thinking about creating a map that shows how many times a state has been referenced in another state’s newspapers. Ideally, we would like to be able to hover over a state with the cursor and it shades that state and other states with intensity determined by number of mentions of places in that state from the origin state’s ads, but we are still figuring out how to do that. We can start to see how this would work by using the current Texas data in Google Fusion Tables to create a preliminary visualization. Aaron and Kaitlyn will also give feedback on the close reading essay to Clare as she continues to revise her draft.

Progress Update on Collecting Information on Arkansas and Mississippi Advertisements

This week, we used the Stanford Named Entity Recognition program to find the newspapers in the Mississippi ads corpus. We had to break up the corpus into several different files because the original text was too long to run in the NER at the same time. By breaking it up into 9 different files, we found 12 different newspapers that were tagged as organizations: the Vicksburg Register, the Natchez Courier and Journal, the Memphis Enquirer, the Louisville Journal, the Cincinnati Gazette, the Port Gibson Correspondent, the Southern Argus, the Alabama Journal, the Woodville Republican and Wilkinson Weekly Advertisor, the Southern Tribune, the Fayette Watch Tower, and the Mississippian State Gazette. We then searched for the number of occurrences of these newspaper titles in the original Mississippi ad corpus. The numbers below are inflated because we still have not figured out how to remove the footnotes. Most search results contained occurrences in footnotes, but most had too many to manually count and remove.

Newspaper Title Number of Occurrences
Vicksburg Register 502
Natchez Courier and Journal 70
Port Gibson Correspondent 158
Southern Argus 39
Woodville Republican and Wilkinson Weekly Advertisor 51
Southern Tribune 2
Fayette Watch Tower 18
Mississippian State Gazette 26

The following newspapers were merely mentioned and not actual newspaper entries: the Memphis Enquirer, the Louisville Journal, the Cincinnati Gazette, and the Alabama Journal. It seems that other newspapers were reprinting advertisements from these newspapers. From these numbers, it also seems that the Port Gibson Correspondent has a similar number of advertisements as we have collected from the Texas Telegraph and Register from the same time period.

During our Wednesday discussion, it was brought up that there are about 150 ads in the Texas Telegraph.  With 244 ads in the Arkansas Gazette, we have two sources relatively similar in size that we can make comparisons between.  However, it was also pointed out in class that there are also different kinds of ads in these sources.  With the inclusion of notices by jailers or sheriffs, these sources are a mix of different aspects of runaway advertisements.  This raises the analytical issue of whether these should be differentiated between.  If the sheriff’s ads are considered distinctly different, then they should be filtered out through identification of keywords like “sheriff” in the text.  However, if sufficiently similar content is available in both types of ads for our purposes (descriptive words, location information, whatever we’re mining for) then they could be allowed to remain in the data set, leaving us with a larger sample size.

Collecting information about Mississippi and Arkansas Advertisements

Daniel and I have been working on looking more closely at the advertisements from Arkansas and Mississippi digitized in the Documenting Runaway Slaves project. Using regular expressions, we are cleaning up the text files in Text Wrangler to remove unwanted information, such as footnotes, extra dates, and page numbers. Our goal is to find out how many total ads there are for each state, how many ads there are in each particular newspaper, and how many ads there are between the years 1835-1860. Below is our progress divided by state.

Arkansas Advertisements – Daniel

By using regular expressions to search for individual dates for ads and separate them into individual text files, we were able to identify 457 separate ads for Arkansas.  Within this subdivision, searching the years of the groups narrowed down the pool of ads to 324 within the range of 1835-1860.

Uploading the text to Voyant Tools, I was able to use the ResoViz tool to identify the different organizations in the ads.  This gave a strong pointer towards which newspaper titles occur most frequently within the base of ads.  Searching for these in the text in Text Wrangler, I was then able to count how many occurrences there were with the “Find All” feature.  This search found 272 occurrences of the Arkansas Gazette.  28 of these were overcounted due to mentions in footnotes (which we were unable to remove from the PDF).  Removing these left an adjusted count of 244 runaway ads in the Arkansas Gazette from 1835-1860.  A similar search revealed the runner-up publications of ads to be 35 ads by the Washington Telegraph during this time, and 31 by the Arkansas Advocate.

Mississippi Advertisements – Kaitlyn

First, I removed the extra date headers by using the regular expression #1, posted as a gist on my github account. Then, I removed the page numbers by using regular expression #2. That’s when I started seeing some issues in how the text copied over from the PDF file I downloaded from the Documenting Runaway Slaves advertisement. As shown in the picture below, I discovered that every time a superscript (such as th, st, or nd) is used, the text does not copy over in the correct order.

As you can see on line 342, the text abruptly cuts off right where the th superscript should be, and the rest of the text that follows is now placed on line 351. The superscript has been placed on line 341 (or line 347 — both contain “th”). The superscripts for numbers were not used consistently throughout the document, so it is not a consistent problem for all of the advertisements. It also poses more of a problem when we start using the advertisements for analysis.

One other problem I discovered is that some of the dates in the [date Month year] format is that some of the lines end in a period, some do not have a period, and some have bracketed edited information. Therefore, I had to use regular expression #3 to figure out how many advertisements the document contained. I found 1633 matches, which was about four times as many as we found from Arkansas. I additionally used regular expression #4 to figure out how many advertisements we had from the period of 1835-1860, and I discovered 1060 matches. There possibly could have been a more effective way to do this, but I think I was able to find them all using that expression.

I am still working on figuring out how to remove all of the footnotes. The footnotes do not seem to have any similarities between them except for a number at the beginning of a line, so it is difficult to remove them without removing advertisement information as well. Additionally, I will use ResoViz to see how many advertisements we have from each newspaper as Daniel did with the Arkansas ads, but because there are too many ads collected from Mississippi to analyze them all at the same time using Voyant, this task is taking longer than I originally thought it would.

Homework #5: Working with Google Maps, Google Earth, and “Time Map” Tools

Over the weekend, I completed the “Intro to Google Maps and Google Earth” tutorial from The Programming Historian. I learned how to import a dataset into a layer on Google Maps. The tutorial used data about UK Global Fat Supply from 1896, and through changing the style of the placemarks, I created a map that colors the placemarks by what kind of commodity that region provided.

Additionally, I learned how to create my own placemarks, lines, and polygons (enclosed areas or regions) on Google Maps. Knowing how to create these vector layers could be important for our project because many of our historical questions deal with geography, such as the difference between the slaveholders’ “geography of confinement” versus the slaves’ “rival geography” (for a full list of questions, see our previous post about historical questions).  However, it is more likely that we will be creating spreadsheets with the data that we will eventually want to use in a map, such as the location of the slave owner or the possible location the slave ran. Overall, Google Maps seems like a pretty simple tool for plotting locations or events. One of the main drawbacks of Google Maps, however, is that it can only import the first 100 rows of a dataset and only 3 datasets for a total of 300 features. It seems like we possibly have more data without narrowing the advertisements down than Google Maps can hold.

The tutorial also let me explore some of the features of Google Earth. Google Earth has the ability to create vector layers like in Google Maps, but it also has more advanced features such as the ability to upload a historical map to overlay over a section of Google Earth.

Map of Canada from 1815 overlayed on Google Earth

Google Earth has an interesting historical imagery view, which includes a sliding timeline bar that shows what a region looked like at a particular moment in time. Clare and I thought that we would be able to add placemarks with certain time stamps so that they only showed up at a certain point in time and then to animate the whole sequence. We tried valiantly to make it work, but the placemarks appeared regardless of which point in the timeline was selected on the timeline bar. At this point, without finding some sort of tutorial, I do not think we can go much further with animating placemarks on Google Earth.

We do think that being able to animate points in time would be useful for us to look at many of our historical questions. Neatline, a tool of the online exhibit creator Omeka, would give us the ability to do this. On Wednesday, I would like to take a closer look at what Neatline and TimeMapper (another tool for making “time maps”) do to see if either is something that we might want to pursue. In addition to looking at these time mapping tools during class, I want to look back over the tutorial on thematic data maps to better understand how Google Fusion Tables works. I think that these tools dealing with geography will potentially be useful in analyzing or presenting our data because of the focus on geography that many of our historical questions have.