Daniel’s question in class on Monday, about whether we were planning to release the ads we have found to the public, reminded me that we had earlier discussed the possibility of tweeting out our transcriptions with a link to the zoomed image in Portal of Texas History.
This tutorial suggests that may not be too difficult, especially now that we have a way to get all of our transcriptions out of our spreadsheets and into text files. It would be possible to write a script that reconstructs the URL to the page image from the title of our text files, and then tweets the first several words of the transcription with a link. That could be a way both of sharing our finds and of increasing interest in our larger project.
Is this something we would still be interested in doing? Thoughts?
Over the weekend I wrote up a script called
txparser.py to get our Texas ads out of the Google Drive spreadsheet where we’ve been collecting them. To use the script, I first downloaded each sheet of our spreadsheet into a separate CSV (comma-separated value) file. (This is a text-based spreadsheet file format that can be easily opened in Microsoft Excel, by the way.) The script then iterates over the CSV files and generates a ZIP file containing each transcribed ad in a text file of its own.
Today in class I briefly mentioned TF-IDF (Term Frequency-Inverse Document Frequency) as a possible way for us to identify "give away" words that might appear more frequently in a particular document. Here are some introductory explanations of the method:
And here’s a cool visualization experiment using TF-IDF made by Tim Sherratt, who also made the Real Face of White Australia and Headline Roulette sites shown in class today.
I also mentioned Named Entity Recognition in class; this is the same library used by the Rezo Viz tool that Daniel and Alyssa showed us in their Voyant Tools presentation. It may be possible for us simply to use Voyant as an interface for NER and export a list of place and person names from our ads, but we need to look into this further.
Don’t forget that we will be having our first meeting of the semester this Friday at 5:30 in Keck Hall, Room 101 (the building with Valhalla). Dinner will be provided for everyone at the beginning.
Our guests for the workshop this Friday afternoon are Natalie Houston and Neal Audenaert. They have received a start-up grant from the NEH Office of Digital Humanities to build a program they are calling The Visual Page.
In my posts on learning Python (Part I and Part II), I’ve been trying to create a script that takes one of my old blog posts and turns it into a correctly formatted markdown file using a program called Pandoc. In plainer English, I’m trying to take some stuff that looks like this and make it look like this.
The ultimate goal is to take a list of URLs from my blog and make an EPUB out of all the content. Getting each page into a Markdown file is an intermediate step. My last post figured out how to do this with at least one page, but I’ve now figured out how to take a list of URLs, convert each of them to markdown, and append the resulting text all together in one file.
In my last Python post, I learned how to get a single webpage from one of my old blogs and convert it from HTML into Markdown. My objective, if you recall, is to take a list of posts from my old blog and convert them into an EPUB.
I chose this task mainly to give me a reasonable goal while learning python, but I’m also thinking about some of the practical uses for a script like this. For instance, say you had a list of webpages containing primary source transcriptions that you wanted your students to read. A script like the one I’m trying to write could conceivably be used to package all of those sources in a single PDF or EPUB file that could then be distributed to students. The popularity of plugins like Anthologize also indicate that there is a general interest in converting blogs into electronic books, but that plugin only works with WordPress. A python script could conceivably do this for any blog.
I’m quickly learning, however, that making this script portable will require quite a bit of tweaking. Which is another way of saying "quite a bit of geeky fun"!
This semester I have been trying to learn a little bit about Python, an open-source programming language. Python is the language used in the introductory undergraduate course for Computer Science here at Rice (and it’s actually the basis for Rice’s first Coursera offering). Fortunately, however, there is an even more introductory course on Python aimed historians. It’s called The Programming Historian, and it’s what I’m using to get started with the language.
The Programming Historian offers lessons and example code that can be immediately useful. Indeed, we heard from Scott Nesbit earlier this semester that he used the lesson on Automated Downloading with Wget to download the Official Records webpages used for the Visualizing Emancipation project. The Programming Historian is also built on the assumption that the best way to learn programming is to do it. History graduate student Jason Heppler puts the point this way in his excellent essay How I Learned to Code:
How do I continue to learn? I simply dig in. Computers are best learned not though books or lecture, but by hands-on experience. … I learn and I write. I trace other’s code to see what each line of code does and how everything fits together.
I decided that if I was going to learn Python, I was going to have to "just dig in" too. So I came up with something I wanted to do, and I’ve set out to see if I can do it in Python.