Today in class I briefly mentioned TF-IDF (Term Frequency-Inverse Document Frequency) as a possible way for us to identify "give away" words that might appear more frequently in a particular document. Here are some introductory explanations of the method:
And here’s a cool visualization experiment using TF-IDF made by Tim Sherratt, who also made the Real Face of White Australia and Headline Roulette sites shown in class today.
I also mentioned Named Entity Recognition in class; this is the same library used by the Rezo Viz tool that Daniel and Alyssa showed us in their Voyant Tools presentation. It may be possible for us simply to use Voyant as an interface for NER and export a list of place and person names from our ads, but we need to look into this further.
Don’t forget that we will be having our first meeting of the semester this Friday at 5:30 in Keck Hall, Room 101 (the building with Valhalla). Dinner will be provided for everyone at the beginning.
Our guests for the workshop this Friday afternoon are Natalie Houston and Neal Audenaert. They have received a start-up grant from the NEH Office of Digital Humanities to build a program they are calling The Visual Page.
In my posts on learning Python (Part I and Part II), I’ve been trying to create a script that takes one of my old blog posts and turns it into a correctly formatted markdown file using a program called Pandoc. In plainer English, I’m trying to take some stuff that looks like this and make it look like this.
The ultimate goal is to take a list of URLs from my blog and make an EPUB out of all the content. Getting each page into a Markdown file is an intermediate step. My last post figured out how to do this with at least one page, but I’ve now figured out how to take a list of URLs, convert each of them to markdown, and append the resulting text all together in one file.
In my last Python post, I learned how to get a single webpage from one of my old blogs and convert it from HTML into Markdown. My objective, if you recall, is to take a list of posts from my old blog and convert them into an EPUB.
I chose this task mainly to give me a reasonable goal while learning python, but I’m also thinking about some of the practical uses for a script like this. For instance, say you had a list of webpages containing primary source transcriptions that you wanted your students to read. A script like the one I’m trying to write could conceivably be used to package all of those sources in a single PDF or EPUB file that could then be distributed to students. The popularity of plugins like Anthologize also indicate that there is a general interest in converting blogs into electronic books, but that plugin only works with WordPress. A python script could conceivably do this for any blog.
I’m quickly learning, however, that making this script portable will require quite a bit of tweaking. Which is another way of saying "quite a bit of geeky fun"!
This semester I have been trying to learn a little bit about Python, an open-source programming language. Python is the language used in the introductory undergraduate course for Computer Science here at Rice (and it’s actually the basis for Rice’s first Coursera offering). Fortunately, however, there is an even more introductory course on Python aimed historians. It’s called The Programming Historian, and it’s what I’m using to get started with the language.
The Programming Historian offers lessons and example code that can be immediately useful. Indeed, we heard from Scott Nesbit earlier this semester that he used the lesson on Automated Downloading with Wget to download the Official Records webpages used for the Visualizing Emancipation project. The Programming Historian is also built on the assumption that the best way to learn programming is to do it. History graduate student Jason Heppler puts the point this way in his excellent essay How I Learned to Code:
How do I continue to learn? I simply dig in. Computers are best learned not though books or lecture, but by hands-on experience. … I learn and I write. I trace other’s code to see what each line of code does and how everything fits together.
I decided that if I was going to learn Python, I was going to have to "just dig in" too. So I came up with something I wanted to do, and I’ve set out to see if I can do it in Python.