After nearly a decade in development , the Yiddish Book Center has launched a new website that will allow users to search the full text of nearly 11,000 scanned Yiddish books. This revolutionary technology uses machine learning to process an image of a printed page and transform it into searchable text.
Optical Character Recognition (OCR), which searches the text of scanned books, has been available for English and other major world languages for many years. This technology makes the contents of books searchable and accessible, allowing for more fine-grained access than traditional cataloging records, tables of contents, or even indices allow.
For Yiddish and many less-widely-spoken languages, OCR is complicated to develop because the computer must be trained to correctly interpret the images into text, regardless of differing font styles, spacing, page layouts, and imperfections in the page image. For Yiddish, it is further complicated by the relative age of most Yiddish books and the small number of Yiddish texts in existence (as compared to a language like English).
The Yiddish Book Center has partnered with Assaf Urieli, developer of JOCHRE, an OCR application that uses machine learning to improve its analysis over time. Assaf and the Yiddish Book Center have launched the site with the majority of titles in the Steven Spielberg Digital Yiddish Library available to search in full text. The site is currently in beta testing and is improving daily, as the engine is designed to train itself based on corrections submitted by human users.
In allowing users to search the actual text of nearly 11,000 Yiddish books, this technology will enable searches that used to take years to occur in a matter of seconds, revolutionizing research in Jewish history, literature, linguistics, ethnography, and genealogy.
To visit the full-text search site, go to https://ocr.yiddishbookcenter.org.
Read the Pakn Treger profile Assaf Urieli: Computational Yiddish Linguist