As indicated on the syllabus, your first Progress Report on our class project is due this Monday, March 17, by the end of class. The progress report should take the form of a correctly formatted, hyperlink-rich post to this blog. Each group needs to make only one post, but you should work together on the post and will be assigned a grade on the report as a group. Note that the report needs to show your progress, even if you haven’t yet completed all the tasks assigned to you. The groups/tasks we assigned last Monday are as follows, but keep in mind that groups and tasks will shift as we move forward.
Category Archives: Homework
These last few days, Franco and I have been developing a way to detect runaway slave ads in images of 19th centuries newspapers. The Portal to Texas History has digitized copies of thousands of issues of Texas newspapers and is a source waiting to be explored for runaway slave ads. For example, a search for “runaway negro” in the full-text (OCR transcriptions) of their collection yields 7,159(!) results. Clearly, that number is too high to accommodate manual perusal of all possible matches.
Thus, we have been thinking about ways to automate the process. Under the suggestion of Dr. McDaniel, we decided to use OpenCV, a popular open source computer vision library, to conduct object recognition for the classic runaway slave icon. You know, this one:
(In newspapers, from what I have seen, it usually appeared much smaller and simplified, as shown here).
OpenCV has a tool called Cascade Classifier Training that builds an XML file that can be used to detect objects. It requires a set of positive samples, images that contain the chosen object, and negative samples, images that do not contain the object but are of similar context. It works best with a large dataset of positive samples, and to generate that it provides a function called “createsamples” that takes an image and applies transformations to it, such as adjustments in intensity, rotations, color inversions, and more to make altered versions. Once the cascade has been trained, it can be used to efficiently detect and locate the desired object in other images.
So, the first order of business in preparing to do object recognition was to collect a set of runaway slave icons. I downloaded ~35 newspaper page images containing the icon and cropped the images to only contain the icon visible. The tutorials [1, 2, 3 ..others] I read suggested that for best results the positive images (images of the object to be detected) should all be the same aspect ratio. For simplicity, I made sure all my images were 60x64px.
Next I generated a set of negative (background) images that were from newspaper images that did not have the runaway icon. These had to be the same size as the positive images. I read that a large data set was especially needed for the negatives, so I wrote a simple script to crop newspaper page images into a series of individual 60×64 pics. For anyone curious, here’s a gist of the code. A typical image looked something like this.
After running the script on several images, I ended up with ~1600 negative images to use in training the cascade classifier. I supplemented that with some manually-cropped pics of common icons such as the one that appears to the left.
Next I used the find command in terminal to output text files containing a list of all the positive and all the negative images. Then, I created the “sample,” a binary file that contains all the positive images that is required by the cascade trainer (opencv_traincascade). Like I mentioned, usually in creating the sample, transforming settings are specified to multiply the amount of data available to train the cascade. I figured that the runaway icon would always appear upright, and I made sure my positive images set contained icons of varying clarity, so I just ran opencv_createsamples without any distortions.
Finally, I had all I needed to train the cascade. I ran the following command in Terminal:
opencv_traincascade -data classifier -vec samples/samples.vec -bg negatives.txt -numStages 6 -minHitRate 0.95 -numPos 27 -numNeg 1613 -w 60 -h 64 -precalcValBufSize 512 -precalcIdxBufSize 256
Opencv_traincascade is the program to be run. The value for data is the name of the folder to store the resulting cascade file in. The value for vec is the path to the samples vector file. The value for bg is the name of the file containing paths to each negative image. numStages I am not entirely sure so I just picked 6 since I didn’t want the training to run for days as others have experienced. minHitRate dictates the accuracy. numPos I still don’t quite understand, but I chose ~80% of the number of positive images to ensure no errors would result. numNeg is the number of negative images. Then there’s width, height, and some settings specifying how much RAM the program can hog up.
I had high hopes, but after 30 minutes of fans-blaring CPU use the program quit with the error, “Required leaf false alarm rate achieved. Branch training terminated.” I need to do more research to figure out why it didn’t work, but an initial search told me that the number of positive samples I used may not be enough. Joy..
- Play around with OpenCV some more to try to get a functional cascade. Maybe enlist the help of stackoverflow or reddit.
- Rethink whether object recognition is the best way to maximize runaway slave ad discovery. While a lot of ads did use the icon, perhaps a larger number did not. For newspapers with digital transcriptions, text-based analysis would surely provide better results.
- If we can’t get a working cascade to do object recognition, revisit newspaper decomposition. Franco and I tried using Hough Line Transforms through OpenCV to detect lines separating newspaper articles, but to no avail. Its promise is marked up images like the Sudoku board shown below. To the right of it is our “success.” The theory is that if we could detect the dividing lines in newspapers, we could crop the pages into individual articles, run OCR on each article, and then do text-analysis to discover runaway ads. It is no easy feat, though, as these [1, 2] research articles demonstrate.
I was able to improve our results by limiting detected lines to those with approximately horizontal or vertical slopes, since those are the only ones we are interested in newspapers, but it is clear we need to tweak the script or enlist a better system.
Before class this Wednesday, your homework assignment is to publish a post to this course blog reporting on what you, individually, have been working on and thinking about in preparation for your small group presentations next week. Be sure to include not only what you’ve done, but what you’re planning to do next (i.e., in class on Wednesday).
Your post should be informal, but substantive, and should take advantage of the blogging medium. For example, if an image or screen shot would help to illustrate your progress, you should embed an image in your post. If you are referring to other websites (for example, the tutorials you are using, or examples of other sites), then the reader would probably appreciate a hyperlink. Providing these relevant enrichments to your test will improve your homework score; a post that provides nothing but text will have a maximum score of 7, whereas posts that provide relevant media and links can achieve the maximum score of 10.
You should also think about your audience for this post, which is potentially Internet-wide. Although you are writing to students in our class, your experience may also be valuable for other would-be digital historians wanting to know how you used these tutorials and what difficulties or successes you are encountering. So aim your post at a potential audience that could include students in the UNT course and other history students or historians interested in using digital tools like these.
For this homework assignment, you should write a comment on this blog post that responds to the readings and discussions that we have been doing in class, including the readings for Monday.
If you prefer, you may download this assignment in PDF form.
For our January 31 class, you read several articles about using a method called "topic modeling" to "read" texts algorithmically. In this homework assignment, you will have a chance to use MALLET, a topic modeling software package, yourself and then write a reflection on your experience that applies what you have learned to our class project.
Before You Begin
This assignment will require you to use the command line on your computer. I recommend that before you begin, you review some of the material on this that we covered in class on Friday.
If you have a Mac or Linux machine, the Command Line Bootcamp from the Scholars’ Lab at the University of Virginia is a useful place to begin, and it is aimed at humanities students and scholars. If you have a Windows machine, here is a basic introduction to the DOS prompt.
Regardless of your machine, there are three main things you will need to be able to do in this assignment from the command line, so make sure you understand how to do each of them:
- See what directory you are currently in.
- Change directories.
- List the contents of the current directory.
- See inside the contents of a file.
You may also want to know how to clear your terminal screen if it becomes too crowded with text. You can do this with the command
cls at the Windows command prompt and the command
clear at the Unix/Mac command line. (Even after clearing the screen, you should be able to scroll up in your terminal windows to see what you’ve done in the past.)
- To gain a basic familiarity with the command line.
- To install and use MALLET with the sample data included in the package.
- To reflect on the uses and limitations of topic modeling in historical research.
- To gain experience and confidence in following a detailed tutorial for an unfamiliar tool.
There are both technical and non-technical requirements for this assignment, but the two parts are separable. I recommend that you attempt the technical part first since it will probably take longer, but if you get stuck, you should be able to answer the questions in the non-technical part before completing the techy stuff.
Complete the tutorial on Getting Started with Topic Modeling and MALLET at the Programming Historian, which will show you how to install MALLET and then use it on the sample documents included with the package.
This requirement will be completed when you tweet two screenshots of your work to the course hashtag
#ricedh. More specifically:
- One screenshot should, like Figure 8 in the tutorial, show the output of a
train-topicscommand on the sample data set discussed in the tutorial, but should show that you generated 15 topics instead of the default 10.
- One screenshot should, like Figure 10 in the tutorial, show a screenshot of the
tutorial_composition.txtfile generated by your 15-topic model opened in Excel. (If you don’t have Excel installed on your computer, you can also satisfy this requirement by creating a GitHub Gist containing the contents of your
tutorial_composition.txtfile and tweeting the link to the Gist instead.)
If you are not familiar with how to take screenshots on your computer, do some Googling to find out the answer, or ask on Twitter for help. You will also need to learn how to post photos on Twitter.
After reading the Friday texts about topic modeling and trying out MALLET yourself, you should be able to figure out answers to the following two questions:
- Suppose we wanted to create a topic model of the runaway slave ads we have collected on our Google Spreadsheet. What first steps would we have to take to get from our spreadsheet of permalinks to a
*.malletfile that we could train topics on?
- In his Mining the Dispatch project, Robert K. Nelson used MALLET to find articles that were likely to be fugitive slave ads in a large corpus of digitized newspapers. What feature(s) of the Portal to Texas History would have prevented us from using the same method to discover ads in the Telegraph and Texas Register? Be as specific and thorough as possible. (Here’s a hint: do some searching for keywords in the Telegraph and Texas Register on the Portal, and notice what kinds of results you get back. Does the kind of result returned by a keyword search tell you something about the way that the underlying text documents in the Portal are stored and separated from each other?)
Write up an email to me answering both of these questions. You should be able to answer them with just a few sentences in each case—no more than two good-sized paragraphs should do the job.
Summary and Evaluation
Successful completion of this assignment will include:
- Two screenshots posted to Twitter to satisfy the technical requirements.
- An email to me answering the two non-technical questions.
Because this assignment has several, separable parts, I will divide up the points for the assignment this way when evaluating your homework: two points for each screenshot, and three points for each answer in the email.
Help! I’m Stuck!
There is a good possibility you’ll encounter technical difficulties when doing this assignment. Don’t fret or bang your head against the wall all weekend if you are getting an error message that is not mentioned in the tutorial, or if you are having trouble getting the same results shown in the tutorial. Instead, get help!
You can always take to Twitter if you need help. If you are getting error messages in your terminal that are longer than 140-characters or difficult to explain, you can also use a Gist, as you did in the first homework, to get help. Copy and paste the strange output of your terminal into a Gist, putting an explanation of what produced it in the Gist "description," and then tweet the URL to that Gist to our course hashtag to see if I or another student can help. (And remember, helping out other students is a way to score well on the Team Participation part of your grade.)
Remember, though, the academic integrity policies for the course. Do not get someone else to do the work for you and be sure to acknowledge any pointers or technical assistance you received—in this case by noting it in your email to me.
If you prefer, you can download these instructions in PDF form.
Runaway slave advertisements from nineteenth-century Texas appeared in newspapers that have been digitized. That is, they have, like all digital representations of analog sources, been partially digitized. The Portal to Texas History at the University of North Texas contains full-page images of many nineteenth-century newspapers, together with metadata about the newspapers themselves and OCR text for each newspaper page that makes it possible to search for text.
But these newspapers have not been digitized so as to provide metadata or descriptions at the level of individual articles. That means, to paraphrase Daniel J. Cohen and Roy Rosenzweig, part of the information visible to the eye (i.e., information about when a new article or ad begins and ends) has been lost (or at least not digitized) in the process of the newspaper’s "becoming digital."
This presents a problem for researchers, like us, who are interested in a particular kind of article—runaway slave advertisements. In this homework assignment, you will engage in the practice of digitization by looking through page images from one year of the Telegraph and Texas Register, identifying advertisements pertaining to runaway slaves, and inputting some basic metadata about the ad into the collaborative spreadsheet that you used in Homework #1. In the process you will also learn to pay attention to the "interface" of a search database and gather new information about how acts of resistance or flight by enslaved people were represented in primary sources.
- To gain familiarity with how one major digitization project has decided to produce and share digital objects.
- To generate new questions about the kinds of information contained in runaway slave advertisements and how they changed over time.
- To help complete a complete database of runaway slave advertistements found in one of Texas’s longest-running nineteenth-century newspapers.
Before You Begin
Also spend some time browsing the site and looking through the help guide for the site, particularly the one on using newspapers. Think about how the site is organized and what sorts of searching and browsing are possible (or not possible) with the user interface provided. Run a few searches about something that interests you, and click through on the results. Get a feel for how the site "works," spending at least 10 minutes on this before proceeding.
Now head over to this lesson from the Programming Historian website about downloading records. You’re not actually going to be "programming" for this assignment or doing any downloading; you should only read the first three major sections of this lesson: "Applying Our Historical Knowledge," "The Advanced Search on OBO," and "Understanding URL Queries." These sections give you a tour through the Old Bailey Online, whose search query interface is broadly similar to the Portal to Texas History. Pay particular attention to what the lesson shows you about "query strings." Then go back to the Portal and run some more searches, noting how the values in the URL query strings change as you navigate through the site or run different searches.
Now you are ready to proceed to the homework assignment.
Step 1: Find Newspaper Issues
Each of you will receive an email from me assigning you a year (or the equivalent number of issues) from the Telegraph and Texas Register that you will be responsible for reading in search of runaway slave advertisements.
Your first task is to figure out how to perform a search (or modify a search URL) so as to pull up (in "date ascending" order) all of the issues from the newspaper in your time period.
Here’s an example of what such a "Search Results" page looks like for the 1843 volume of the Register:
Once you have a page that you believe shows the first page of all the results from a search for issues of the newspaper in your assigned timeframe, tweet that URL directly to me
@wcaleb with the course hashtag so that I can check the URL and make sure you have found all the relevant issues. I have to approve this URL by a reply tweet before you can continue.
Step 2: Find Runaway Ads
Now you will be ready to go into each issue and look for ads. You’ll click through to the "Read this Newspaper" tab of each issue, and then click on "Zoom (Full Page)" so that you can magnify the image. Use the arrow pages at the top to flip through the various pages (or sequences) of the issue. Even though the ads are most likely to appear on pages 3 and 4, make sure that you at least run your eyes over every inch of every issue.
This will take time so start early. I recommend that you time how long it takes you to get through one or two issues following the steps below, so that you can plan your schedule accordingly.
The ads will come in different formats, and may have very different amounts of information. Some of the ads will be posted by subscribers who are seeking to find a slave who has runaway. Other ads will be posted by sheriffs or others who have captured a slave and are seeking the legal owner. If it looks like a runaway slave ad to you, or just looks like it has to do with runaway ads (e.g., a notice from the newspaper about how to submit a runaway ad, or an item about the different graphics used in ads), you should go ahead and enter it into the spreadsheet. Right now we just want to identify items of interest, so better to cast a wide net than a narrow one!
Once you’ve found an ad, you’ll need to enter it on the Google spreadsheet of runaway ads already collected. Be sure to carefully follow these instructions when you enter:
- Before you start entering, find the appropriate "sheet" by looking at the tabs at the bottom of the window. Each year has its own "sheet" or tab, so find the one that belongs to your year.
- As shown in the labels at the top of the sheet, you should list the year, month (as a number) and day (as a number) of the issue on each row.
- Also include the full citation (which you can copy and paste from the top of zoomed page at the Portal to Texas History):
- To generate a "permalink" URL that can be copied into the permalink column, first use the "Zoom" feature to enlarge the ad and center it in your browser’s viewing window. Make it as large as you can while still keeping all of the ad within view. Then click on the "Permalink" button while zoomed in on the ad.
- If you think you recognize the ad as one you have seen before in a previous issue, identify it as a reprint by placing an asterisk in the final labeled column. You can use the other blank columns to the right to make any helpful notes to yourself (for example, by noting the names as a way of helping you to remember which might be reprints).
Finally, if you go through an entire issue of the paper and find no runaway ads, make a single row that indicates the date of the issue, and then type "None" in the "Full Cite" column.
Step 3: Reflect on Findings
After you have finished looking through all of your assigned issues, return to the JSON gist that you submitted for your first assignment and notice what pieces of information you found significant about the three ads you looked at then. In the process of going through your year of newspapers, did you notice new kinds of information that you had not seen before? Are there name/value pairs you would add to your JSON if you were doing it again? Was there anything about the newspapers (either in the ads or in the surrounding material) that surprised or interested you?
For the final step in this assignment, leave a comment on this blog post answering at least one of these boldfaced questions. You may use a non-identifying pseudonym as you make your comment, so long as you let me know which pseudonym you used.
To recap, successful completion of this homework requires:
- A tweet to me, with the course hashtag and the URL to the first page of search results from Portal to Texas History containing all the available issues of the Telegraph and Texas Register in your assigned time period.
- A completed tab in the Google spreadsheet that documents all of the ads contained in the paper in the time you were assigned.
- A brief comment reflecting on what you saw in the newspaper.
Points will be deducted from the assignment if the above technical requirements are not met or if the work contains numerous typographical errors, as well as for blog comments that do not seriously engage with the questions asked and reflect a thoughtful encounter with the newspapers you saw.
As in the first homework assignment, you can always take to Twitter if you need help, but in keeping with the academic integrity policies for the course, do not get someone else to do the work for you and be sure to acknowledge any pointers or technical assistance you received—in this case by noting it in your blog post comment.
If you prefer, you can download this homework assignment as a PDF file.
What information from a runaway slave ad would need to be captured in digital form in order for the information to be useful to a historian? How can that information be stored in a way that makes it easier for a computer to understand it?
In this assignment you will use three runaway slave advertisements from a nineteenth-century Texas newspaper to come up with a simple schema for digitizing the data in the ads. After deciding on the elements in your schema and their data types, you will put the information into valid JSON, a simple data formatting language that is easy for both humans and machines to read.
Note: This instruction sheet is long and detailed in order to make the assignment as clear as possible, but don’t be daunted by the length! If you work through this step-by-step, you will be able to complete the assignment for full credit, and help will be available!