Since its latest edition, the Google Translate application now supports over 100 languages and serves a worldwide community of over 500 million users virtually. But over 6,000 languages are actually spoken, with around 360 languages spoken by a million people or more.
To combat this disparity in translation technology, the USC Information Sciences Institute is working on a universal language translation system. Supported by the Defense Advanced Research Projects Agency, the program involves ISI Director of Natural Language Technologies Kevin Knight and a small team of full-time researchers and doctoral students.
Knight has worked with ISI in natural language processing for 25 years, and the system is one of many projects he’s working on in the field.
“That’s our goal — a universal language processor that works for any language,” Knight said. “We have a machine translation system [that] learns from examples to translate between languages. But we have a lot of data, for example, of all the human translations from the United Nations between English and French [to] train a system very well.”
However, the potential roadblock for the processor comes from languages less commonly spoken and limited existing data.
“If you don’t have a lot of data, which is the case for most languages, you have to be more creative,” Knight said.
Knight described an African language called Oromo that the team is currently working on. Oromo has a very free spelling convention, which makes it easier for humans to learn but difficult for a machine to connect two different spellings with one word.
“Just like you can spell the word ‘gray’ in English as ‘g-r-a-y’ or ‘g-r-e-y’, in [Oromo], almost any word can be spelled in almost any way,” Knight said. “That’s great for a person — a person doesn’t care — but when a computer reads it, it’s like, ‘Oh that word’s different from that word; I don’t know how to deal with it.’ So, you kind of have to be robust.”
Nima Pourdamghani, a doctoral student studying computer science, is a research assistant at ISI who explained the complications of creating a universal tool. Pourdamghani explained how languages each hold an extensive amount of vocabulary, but the team is developing ways to make the translation easier for them.
“Suppose we want to translate some language, but we have a related language from some nearby country,” Pourdamghani said. “We try to convert resources from that language into this language and get help from them.”
Pourdamghani says the machine can recognize Arabic and Latin as similar languages and can use data from one for the other.
“We were trying to have the machine both extract and clean dictionaries for all languages across the world,” Pourdamghani said.
The system, however, is still not well-equipped to deal with prefixes or suffixes, as well as the plural form of words. For example, the word “cat” and “cats” do not relate because the system is based on training data, Pourdamghani said. And when the machine reaches a word query there is no data for, it can’t continue.
The team plans to work on the system, smoothing out its kinks. According to Knight, a few upcoming additions to the machine include helping it recognize different categories of words like diseases, holidays or personal titles.
The machine, once completed, will help international efforts like disaster relief, Knight said. When aid comes to countries that speak different languages, rescuers need to be able to communicate in various dialects to exchange information.
The translation project is a collaboration with other teams across the nation through DARPA, an agency of the U.S. Department of Defense. While USC focuses on the translation aspect of the project, other universities specialize in identifying the words that are people, places and organizations from any language, Knight said. Every year there is a common evaluation to share each group’s innovations.
“It’s a big project that different teams in the country, even across the world, are collaborating on, and we’re sharing tools and sharing ideas together,” Pourdamghani said. “Hopefully in the next few years, we can have a better translation system for other languages.”