Over the weekend I wrote up a script called
txparser.py to get our Texas ads out of the Google Drive spreadsheet where we’ve been collecting them. To use the script, I first downloaded each sheet of our spreadsheet into a separate CSV (comma-separated value) file. (This is a text-based spreadsheet file format that can be easily opened in Microsoft Excel, by the way.) The script then iterates over the CSV files and generates a ZIP file containing each transcribed ad in a text file of its own.
The file contains nothing but the transcribed text, while the filename contains a bunch of metadata from the spreadsheet. Here’s a sample filename:
The first two letters identify this as a Texas ad (which will be important later for distinguishing it from ads from AK and MS. The numbers after the first underscore are the date in YYYYMMDD format. After that comes the name of the newspaper the ad is from (Telegraph), followed by the papers unique id in the Portal of Texas History (which is scraped from the permalink URL). After that comes more information scraped from the permalink, including the issue ID, the page number, and the latitude and longitude values of the zoomed ad.
Starting today, I’m going to be keeping the most recently updated version of our ads as a ZIP file on OWL-Space. (There’s still some work to do on our spreadsheets and those at UNT to make sure the ZIP file is complete, but for now you can start using the ZIP file for experimental purposes. I’ll tweet whenever I’ve posted a new ZIP file to OWL-Space.
Note that if you want to combine all the text files into one text file for some reason, use your Terminal or command prompt to change to the unzipped directory, and then you can use this command to combine them all into a file:
cat *.txt > newfilename.txt
It’s worth understanding what the Python script does, too, because in your projects you may decide that you want to work with only a subset of the ads for some reason. The script opens a CSV file and looks for any rows that contain text in the transcription column. It then gets all of the relevant information from that row and puts the output text file in a zip file.
Given that information, you should be able to figure out how to remove certain ads from the output ZIP file if you desired. First, you could download the spreadsheet from Google Drive as a Microsoft Excel and manipulate the saved file as desired. (Note, please do not make changes on the Google Drive spreadsheet anymore—if you are going to delete or move cells, download a version of the spreadsheet to your computer first.) You could then export or save the manipulated spreadsheet as a CSV file, and run my script on your modified CSVs to generate a ZIP file tailored to your specifications.
To do all this, of course, you need to have Python installed. You may already have it on your machine, but if not, there are instructions at the Programming Historian for Mac, Linux, and Windows. Next you need to download the scripts from our GitHub repo. To use the script, first copy it into the same directory containing the CSV files that you want to scrape. Then, at the command line, enter this:
Note that you can also give the script a preferred filename for the ZIP file:
This will produce a ZIP file called
jailor-only-ads.zip in the same directory where you run the script. If you don’t give a filename, the default name will be something like
runaway-ads-20140324-132005.zip, where the numbers represent a unique time stamp for when you created the file.
If you have problems with any of this, let me know in the comments! In answering your question, I may be able to also help someone else. Also, if you need a refresher course on how to find your way around in the terminal, revisit the suggested tutorials from Homework #3 or visit The Command Line Crash Course.