Author Archives: fdb1

Parsing Newspaper Images

We are trying to parse newspaper images into discrete, smaller image components containing separate articles – which (unsurprisingly) is proving more difficult than we imagined. We are trying to use OpenCV to separate different articles from each other by identifying lines in the newspaper and using those lines to separate articles, but the line detection Hough Transformation program works very poorly on the input articles. We are now switching to finding the runaway slave icon in the text, which we are doing through image recognition software (HAR image detection) in OpenCV. We have not given up parsing documents by articles, though – which we are now considering parsing by image variation – detecting text from whitespace through pixel values, and then mapping text lines to find changes in text style corresponding to the end of one article and the beginning of another.

– Franco Bettati, Aaron Braunstein