This week, we used the Stanford Named Entity Recognition program to find the newspapers in the Mississippi ads corpus. We had to break up the corpus into several different files because the original text was too long to run in the NER at the same time. By breaking it up into 9 different files, we found 12 different newspapers that were tagged as organizations: the Vicksburg Register, the Natchez Courier and Journal, the Memphis Enquirer, the Louisville Journal, the Cincinnati Gazette, the Port Gibson Correspondent, the Southern Argus, the Alabama Journal, the Woodville Republican and Wilkinson Weekly Advertisor, the Southern Tribune, the Fayette Watch Tower, and the Mississippian State Gazette. We then searched for the number of occurrences of these newspaper titles in the original Mississippi ad corpus. The numbers below are inflated because we still have not figured out how to remove the footnotes. Most search results contained occurrences in footnotes, but most had too many to manually count and remove.
|Newspaper Title||Number of Occurrences|
|Natchez Courier and Journal||70|
|Port Gibson Correspondent||158|
|Woodville Republican and Wilkinson Weekly Advertisor||51|
|Fayette Watch Tower||18|
|Mississippian State Gazette||26|
The following newspapers were merely mentioned and not actual newspaper entries: the Memphis Enquirer, the Louisville Journal, the Cincinnati Gazette, and the Alabama Journal. It seems that other newspapers were reprinting advertisements from these newspapers. From these numbers, it also seems that the Port Gibson Correspondent has a similar number of advertisements as we have collected from the Texas Telegraph and Register from the same time period.
During our Wednesday discussion, it was brought up that there are about 150 ads in the Texas Telegraph. With 244 ads in the Arkansas Gazette, we have two sources relatively similar in size that we can make comparisons between. However, it was also pointed out in class that there are also different kinds of ads in these sources. With the inclusion of notices by jailers or sheriffs, these sources are a mix of different aspects of runaway advertisements. This raises the analytical issue of whether these should be differentiated between. If the sheriff’s ads are considered distinctly different, then they should be filtered out through identification of keywords like “sheriff” in the text. However, if sufficiently similar content is available in both types of ads for our purposes (descriptive words, location information, whatever we’re mining for) then they could be allowed to remain in the data set, leaving us with a larger sample size.