In my posts on learning Python (Part I and Part II), I’ve been trying to create a script that takes one of my old blog posts and turns it into a correctly formatted markdown file using a program called Pandoc. In plainer English, I’m trying to take some stuff that looks like this and make it look like this.
The ultimate goal is to take a list of URLs from my blog and make an EPUB out of all the content. Getting each page into a Markdown file is an intermediate step. My last post figured out how to do this with at least one page, but I’ve now figured out how to take a list of URLs, convert each of them to markdown, and append the resulting text all together in one file.
Most of what I needed to learn to take this step came from two Programming Historian lessons. Working with Files and Web Pages taught me how to open and read the contents of a text file using Python, and also how to append text to a text file. From HTML to a List of Words taught me about simple “looping” in Python (which is what I needed to step through a list of URLs and do something to each of them).
That lesson also discussed some String Methods and included a link to more on the Python webpage. One of these methods, I learned, would allow me to break a string into a list (another concept discussed in the “HTML to Words” lesson) with each line as an item.
These concepts gave me what I needed in order to …
- Open a text file called
urls.txtcontaining the URL for each blog post I want to convert on a separate line.
- Using a loop, convert the pages at each of those URLs to markdown using the code from my last post, appending the converted text to a single text file on my computer.
The result is pretty cool. First, I create a text file called
urls.text that contains nothing but the URLs that I want to convert:
http://modeforcaleb.blogspot.com/2005/08/first-twenty-minutes.html http://modeforcaleb.blogspot.com/2004/12/lives-of-douglass-part-i.html http://modeforcaleb.blogspot.com/2004/12/lives-of-douglass-part-ii.html
Then I run my script, now modified as described above. The script creates a new file (called
calebpost.txt for now) that contains all three of those webpages in nicely formatted Markdown. It looks like this.
At this point, I’m almost all the way to my original goal of making an EPUB. In fact, now that Python and Beautiful Soup have done the hard work of grabbing the HTML from the web and cleaning it up, I could technically stop now and use Pandoc directly on the
calebpost.txt file to create an EPUB. In fact, I did just that and it worked pretty well. Here’s a screenshot of one of the finished ebook’s pages on my iPad:
But I’m also going to keep plugging away at this script to see if I can go from start to finish using Python. And I’m also beginning to think about ways I could tweak the script so that it would be more useful in a variety of test cases. For example, imagine if you could make a text file that provided a sort of table of contents for your EPUB. So long as the user provided the urls and some clues about where Beautiful Soup should look for the main content, I wonder if my script (when modified) could take an input file like this:
% Course Readings % W. Caleb McDaniel % November 20, 2012 Sam Wineburg, "Goodbye, Columbus" http://www.smithsonianmag.com/history-archaeology/presence-famous.html Laurel Thatcher Ulrich, "How Betsy Ross Became Famous" http://www.common-place.org/vol-08/no-01/ulrich/
… and then make an EPUB that could be distributed to students.
The trick would be to be able to tell the script where to look in the HTML on each page for the appropriate content, but such a trick would be useful …