The Intelligence Advanced Research Projects Activity presented researchers from the USC Viterbi School of Engineering’s Information Sciences Institute with a $16.7 million grant to develop a tool that will more efficiently translate low-resource languages. The project, titled SARAL, which stands for Summarization and domain-Adaptive Retrieval, will focus on creating systems that will provide automated translations and summaries of those languages.
Leading the team are principal investigator and ISI research team leader Scott Miller, ISI computer scientists Jonathan May, ISI research lead Elizabeth Boschee and senior advisors Prem Natarajan and Kevin Knight. The research team includes about 30 researchers well-versed in machine translation, speech recognition, morphology, information retrieval, representation and summarization.
The purpose of the systems is to retrieve foreign language documents and provide summaries about how they are relevant to proposed questions.
“The overall objective is to provide a Google-like capability, except the queries are in English, but the retrieved documents are in a low-resource foreign language,” Miller said to USC News.
The summaries will be 100-word responses to domain-specific questions.
“You can think of the summary as something like Cliffs Notes, but with the added feature that it is indexed to the precise part you want to write your essay about,” May said to USC News.
The systems being developed are different from current systems because the ones currently in use depend largely on wide ranges of written samples to become acquainted with the languages. Though they are spoken by millions, low-resource languages do not have a large number of documents. The systems will first be tested using Tagalog and Swahili, both obscure languages selected by the IARPA. As the project develops, more languages will be added to the translating systems.
“Since we don’t have a lot of written data in these languages, we have to do more with less,” May said to USC News. “Ideally, we would use about 300 million words to train a machine-translation system — and in this case, we have around 800,000 words. There are about 100,000 words per novel, so we have only eight novels’ worth of words to work from.”
Those working on the project will begin by gathering a variety of documents that have been translated to English in the past. Documents will include speech, online documents and video clips.
Afterward, they will create algorithms to analyze different language patterns such as sentence structure and morphology, the study of word formation.
Other universities are also working toward creating translation systems. IARPA’s project MATERIAL — Machine Translation for English Retrieval of Information in Any Language — is currently being developed by John Hopkins University, Columbia University and Raytheon BBN Technologies.