Geography Team Progress Report 4/7

Upon getting to work and trying to follow our schedule, we have realized that we planned to have too many things due in a very short period of time. We are in the process of adjusting the schedule and replanning what needs to be done. However, the three of us have been working on a few different tasks during the break.

Clare has been working on a draft of the close reading. Her rough draft includes an introduction, analysis of Arkansas, discussion of the advantages of the digital, and conclusion. The analysis of Arkansas will act as a prototype of what she plans to do with the conclusions she is reaching about Mississippi and Texas, although her data collection requires more time than previously realized. She plans on diversifying advertisement examples, as her current examples are from a few select years. We will discuss suggestions for the progress of the essay with Dr. McDaniel.

Aaron wrote a python script called that tags locations in each advertisement. He ran the cleaned advertisements from the two Texas newspapers and the Mississippi and Arkansas corpora through the script and saved them as JSON files. I then started to try to run the tagged locations through GeoNamesMatch, but I quickly ran into some difficulties. After discussing with Aaron, we decided that the input and output of this particular program was inconvenient for what we are trying to do. Aaron played around with Google’s free geocoding API (using the Python library Geopy) and had some success with it, so we have decided to use that instead. Aaron and I then started cleaning up the pretty printed JSON of the tagged locations, and we realized that even though we don’t have to correct spelling or extend state abbreviations, this task is going to take a very long time because of the large number of advertisements we have, especially in Mississippi. Our original plan was to compare the output of NER to the actual advertisement–essentially just using the NER results as a footing for the actual list of locations–but due to the large amount of data and the limited amount of time left in the semester, that might be infeasible.

Next Steps

Through cleaning the tagged locations, we noticed that the python script has been separating locations that should be together. Some results come out as [Travis County], [Texas] instead of [Travis County Texas], or even as [County] instead of [Travis County]. Additionally, we noticed that NER misses county names when the word “County” appears lowercase, so before we run the script again we will fix the capitalization in our input files. It is unlikely that we will ever be able to write a script that catches every location with precision, but we would like to be as close as we can get.

Thus, Aaron is planning on revising so that it does not split up the county or city name from the state name, perhaps by setting a threshold for the gap between each match in the text for the results to be considered distinct entities. Once that is done, he will rerun the advertisement corpora through the new script, and then he and Kaitlyn will begin cleaning the results. We will need to come up with a few parameters or rules for cleaning the results so that there is consistency across the states. We will also need to decide if we should compare the results for each advertisement to the original text. That could be a very time consuming process, so we may choose to compare a subset of the entire results to the original advertisements, or reduce the number of advertisements overall for which we will produce data.

Even though we will have to reclean the results, we can still use the current cleaned up results from Texas to start thinking about how we want to visualize our results. Right now, we are planning on looking at Palladio to see if it will fit our needs. We also have been thinking about creating a map that shows how many times a state has been referenced in another state’s newspapers. Ideally, we would like to be able to hover over a state with the cursor and it shades that state and other states with intensity determined by number of mentions of places in that state from the origin state’s ads, but we are still figuring out how to do that. We can start to see how this would work by using the current Texas data in Google Fusion Tables to create a preliminary visualization. Aaron and Kaitlyn will also give feedback on the close reading essay to Clare as she continues to revise her draft.

2 Responses to Geography Team Progress Report 4/7

  1. Looks like things are going well, and also that you are hitting up against a perpetual tension in this kind of text mining work. On the one hand, as you note, it may be impossible to write a script that works perfectly to capture all locations, but on the other hand, it is difficult to manually extract all of the data, which points to the need for a script in the first place! The solution may be to simply be reflective and clear about the limits and advantages of what you are doing in the final write-up.

    Is Clare going to be posting her draft in our private Github repo so that you all (and the rest of us) can comment on it?

  2. Never mind my question about Clare’s draft. I see now that it was posted here.