This semester I have been trying to learn a little bit about Python, an open-source programming language. Python is the language used in the introductory undergraduate course for Computer Science here at Rice (and it’s actually the basis for Rice’s first Coursera offering). Fortunately, however, there is an even more introductory course on Python aimed historians. It’s called The Programming Historian, and it’s what I’m using to get started with the language.
The Programming Historian offers lessons and example code that can be immediately useful. Indeed, we heard from Scott Nesbit earlier this semester that he used the lesson on Automated Downloading with Wget to download the Official Records webpages used for the Visualizing Emancipation project. The Programming Historian is also built on the assumption that the best way to learn programming is to do it. History graduate student Jason Heppler puts the point this way in his excellent essay How I Learned to Code:
How do I continue to learn? I simply dig in. Computers are best learned not though books or lecture, but by hands-on experience. … I learn and I write. I trace other’s code to see what each line of code does and how everything fits together.
I decided that if I was going to learn Python, I was going to have to "just dig in" too. So I came up with something I wanted to do, and I’ve set out to see if I can do it in Python.
My goal is to see if I can transform some of my favorite posts from my old blog (which I began in graduate school) into an EPUB (or e-book) that can be read on mobile reading devices. I already know that I could do this just using Pandoc, a text conversion tool that I used to write my book in plain text. That told me that this is an achievable goal, which is a good one to have when learning a new language.
I also knew, from the PH lesson on Working with Files and Web Pages, that it would be easy to grab a webpage from my blog and save it to a file. In fact, this first step in my project was as easy as changing the URL in some of the code already provided by the PH people:
# save-webpage.py import urllib2 url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33' response = urllib2.urlopen(url) webContent = response.read() f = open('obo-t17800628-33.html', 'w') f.write(webContent) f.close
That code "imports" a Python module called "urllib2," and then uses some of the functions provided by that module to open a URL and save it to a local html file. I figured out that I could change the URL and file name and do the same thing to a page on one of my own blogs, like this:
import urllib2 url = 'http://mcdaniel.blogs.rice.edu/?p=158' response = urllib2.urlopen(url) webContent = response.read() f = open('wendell-phillips.html','w') f.write(webContent) f.close()
So far so good. But now I wanted to know if Python would allow me to use Pandoc to convert the webpage into another format before saving it to a file. So I turned to Google. (Sidebar: when learning to do anything with your computer, Google is your friend. Don’t be afraid to use it!) A quick search for "python pandoc" turned up this page on pyandoc, which looked like exactly what I needed.
I also noted, in the "Get Setup" section, that using Pyandoc involved an
import command just like the
import urllib2 command in the Programming Historian code. Somehow I needed to know how to get from downloading the pyandoc package to being able to import it into my program. So I did another Google search for "install python package," and came up with this page from the Python documentation. Following the directions, I ran
python setup.py install in my Terminal from within the downloaded directory for pyandoc, and it seemed to work.
Now I set about trying to see if I could
import pandoc in my script and transform my
webContent variable into another format before saving it to a file. I knew that Pandoc could transform my single page directly into EPUB, but I also knew (from experience working with Pandoc) that doing this directly would result in a lot of unwanted content that is in the HTML for my webpage being put into the final EPUB. Plus, I want several blog posts to be in the EPUB, not just one. I’m figuring that the easiest way to make sure the EPUB will look the way I want is to convert the HTML for each blog post first to Markdown, a lightweight markup language. (See Lincoln Mullen’s introduction on Profhacker.) Then I can clean up the Markdown and convert all of the posts together into a single EPUB.
This meant that the first step should have been simple: use Pandoc to convert HTML to Markdown within my Python script. Getting to that point was not self-explanatory, however. I first tried to follow the usage examples on the pyandoc page. I could see that these usage examples fed a multi-line Markdown string into something called
doc.markdown. (I knew about multi-line strings, by the way, from the lessons on Programming Historian.) What I wanted to see was if I could take the HTML stored in my
webContent variable (which is itself just a symbol for a multi-line string) and feed it into
doc.html, changing "markdown" to "html" to reflect the correct format.
Eventually, I got this to work with a script that looks like this:
# pandoc-webpage.py # Requires: pyandoc http://pypi.python.org/pypi/pyandoc/ # (Change path to pandoc binary in core.py before installing) import urllib2 import pandoc # Open the desired webpage url = 'http://mcdaniel.blogs.rice.edu/?p=158' response = urllib2.urlopen(url) webContent = response.read() # Call on pandoc to convert webContent to markdown and save doc = pandoc.Document() doc.html = webContent webConverted = doc.markdown f = open('wendell-phillips.txt','w') f.write(webConverted) f.close()
I didn’t get to this working solution right away, and that’s important to note. I tried lots of different things in the code, and when I tried to run my program, I kept getting error messages that didn’t make a lot of sense to me. But I took courage from Stephen Ramsay’s observation that to learn programming requires being okay with error messages. And I also noticed that the error messages changed depending on what I changed in my code.
Sometimes the error messages said something along the lines of "pandoc.Document()" only takes one argument, and you’ve given it two. That made me think I just needed to learn more about the pyandoc module and what "pandoc.Document()" means. Other times the error messages made it look like Python was having trouble finding or running the pyandoc module I had installed. This second error message was confounding enough to send me to Google, which eventually led me to someone else with the same problem. Fortunately, that person had already figured out the solution, which required me to change something in the pyandoc code in a text editor and then reinstall the package.
The other error about the correct number of arguments to use required me to figure out the solution on my own. Basically I tried different ways of arranging these three lines:
doc = pandoc.Document() doc.html = webContent webConverted = doc.markdown
I tried putting
webContent in the parentheses, for example, and that didn’t work. I tried
webContent, which also didn’t work. When I finally figured out what did work (the above lines), I then set myself to understanding what was happening. The line beginning with
doc was basically calling some code from the pyandoc module and getting it ready to receive the input I wanted to convert. The
doc.html line told the module that I was going to be feeding it "html" input (namely, the content of the
webContent variable. I then made a new variable
webConverted that contained the contents of
doc converted into
markdown. The next step was just a matter of saving this variable,
webConverted, to my local file, instead of
webContent as in the original code.
Before finishing up with this first step, I made some notes to myself about what I want to do next:
# TODO: Iterate over a list of webpages # TODO: Clean up HTML by removing hard linebreaks # TODO: Delete header and footer
And I also posted my results to Twitter, asking for advice from none other than this Thursday’s guest speaker, Chad Black. He advised me thusly:
@wcaleb Looks fine. I would make it name the file w/title of the web page by finding the <title> and </title> tags and slicing.
— Chad Black (@parezcoydigo) October 29, 2012
That sounded like a good idea, but I wasn’t sure what he meant by slicing. Guess where I’m turning next …