Unsupervised Transcription of Historical Documents


November 6, 2013


Printing-press era documents are difficult for OCR systems to transcribe because these documents are extremely noisy. However, the noise originates from processes that are causally understood. For example, thickened glyphs are caused by over-inking, and vertical offset is caused by slop in a mechanical baseline. We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our approach gives state-of-the-art results on two datasets of historical document images.


Taylor Berg-Kirkpatrick

Taylor Berg-Kirkpatrick is a 5th year graduate student in computer science at the University of California, Berkeley. He works with professor Dan Klein on applying unsupervised learning techniques to natural language problems. Taylor completed his undergraduate degree in mathematics and computer science at Berkeley as well, where he won the departmental Dorothea Klumpke Roberts Prize in mathematics. As a graduate student, Taylor has received both the Qualcomm Innovation Fellowship and the National Science Foundation Graduate Research Fellowship.