Assaf Urieli: Computational Yiddish Linguist
Assaf Urieli didn’t set out to revolutionize Yiddish literary scholarship. He just wanted a better way to study his roots.
A South African–born, Israeli-raised, American-trained software engineer who lives with his wife and two sons in the French Pyrenees, Assaf, age forty, began several years ago to read everything he could find about his maternal grandmother’s hometown of Shavl, Lithuania. As his research continued, he soon found himself wishing for a faster way to gather information.
“At one point I downloaded a scanned copy of the Shavl yizkor book, Azoy zaynen mir geshtorbn [This is How We Died]. I was thinking to myself how much simpler it would be if I could search the entire book electronically to find references to certain family names. And then I thought, why not?”
Why not train a computer to read Yiddish?
As a computer scientist with a specialty in computational linguistics, Assaf was familiar with a technology called optical character recognition. An application that enables computers to look at a page of printed text, identify individual letters, and determine the words they represent, OCR has existed for English and other languages for some time. But Yiddish OCR has proven far more chimerical. One longstanding project finally reached 80 percent accuracy, which sounds pretty good until you realize that that means every fifth letter— or on average, every word—is wrong. Far more successful is a Yiddish OCR program developed by Refoyl Finkel, a professor of computer science at the University of Kentucky. But Dr. Finkel’s program needs to be specifically adapted for each new title. Assaf, on the other hand, dreamed of a program that would work across all Yiddish titles and, moreover, would employ artificial intelligence to learn from its mistakes, thereby increasing its accuracy with every book.
The reason no one has accomplished such a feat until now is the sheer complexity of the task. There are almost 40,000 separate Yiddish titles. Published in different countries, they vary widely in appearance, with different layouts, different fonts, broken type, imperfect printing, and a surprisingly wide range of spelling conventions. All these factors conspire to make modern Yiddish a difficult fit for computerized textual interpretation. Consistency may be the hobgoblin of little minds, but it is almost a prerequisite for OCR.
Fortunately, Assaf is not one to be easily daunted. He first started tinkering with computers when he was ten. He speaks five languages. He has lived on four continents. He published a book of modern riddles under the pen name Moyshele Rosencrantz. He once recorded a CD of eighteen songs by the great French singer-songwriter Georges Brassens. For his day job he runs a successful software company whose clients include the French space agency. If anyone had the technical and linguistic skills to develop a comprehensive, fully functional OCR application for Yiddish it was Assaf. And, of course, he had a potent personal motivation: the desire to better understand his own family’s history.
“I have always been fascinated by genealogy,” Assaf says. “The funny thing about my family is that if you look at five generations of us, we speak five different languages. My great-grandmother grew up in Lithuania speaking Yiddish. My grandmother’s first language was Russian, my mother’s was Hebrew, mine is English, and my sons’ is French. I always thought that to understand the people who came before, you had to know the language, the culture of their time and place. Well, a hundred years ago my ancestors were all speaking Yiddish. This was their language.”
In 2009, at a lab at the University of Toulouse (where he is pursuing a PhD in computer science), Assaf started writing a program he called Jochre—an acronym for Java Optical Character Recognition. He prepared a “corpus” of Yiddish texts and annotated it by hand, a technique that would allow the computer to recognize patterns within the annotated text and eventually learn how to annotate a new text on its own. Meanwhile he fed the program word lists so that it could begin to differentiate between actual words and random arrangements of letters.
“I was naively hoping to finish writing Jochre over the summer,” Assaf says. “Wishful thinking. It turned out to be a bigger project than I thought.”
The first time Assaf directed the machine to interpret the training corpus, it made many mistakes. It couldn’t always tell the difference, for instance, between a daled and a reysh, two letters that look quite similar in the Hebrew alphabet. Assaf corrected the analysis and added the corrected texts to Jochre’s training corpus. Then he directed the program to run another interpretation. With each pass the computer’s accuracy improved.
By the spring of 2011 Assaf had managed to double the training corpus from thirty to sixty pages. When Jochre finally achieved an accuracy of 97 percent, Assaf knew it was time to share his achievement. He sent an email to the Yiddish Book Center to see if we might be interested in collaboration.
“It was a ziveg min hashomayim—a match made in heaven,” says Aaron Lansky, noting that OCR has been a dream of the Yiddish Book Center ever since it started digitizing Yiddish books fifteen years ago. “At the risk of mixing metaphors, OCR is the holy grail of Yiddish digitization —a way to do instant, Google-like searches of millions of pages of Yiddish literature. Research that might take a scholar ten years to complete could be done in ten seconds instead.”
Last year, Yiddish Book Center fellow Josh Price worked closely with Assaf to further expand Jochre’s training corpus. With each refinement, a fully functional, industrial-scale Yiddish OCR has come into sharper focus. Josh has since gone on to graduate studies in Jewish history, and a new fellow, Agnieszka Ilwicka, has taken over his responsibilities.
“I think little by little, within the year, we will have something that we are ready to show the world,” says Assaf, “to publicize and say, ‘Here it is. Yiddish literature is here, and it is searchable.’”