Found in Translation
Words matter. But for all the technology out there, today’s translation systems support only about 100 of the world’s 7,000 languages. So-called “low-resource” languages may be spoken by millions of people, but are not prevalent in written texts. This creates a challenge for translation systems that typically “learn” from seeing millions of written examples.
A team of researchers and students from the Natural Language Lab at USC Viterbi’s Information Sciences Institute (ISI) is tackling this challenge by developing machine learning tools to quickly translate important information in any of the world’s languages.
The exercise isn’t merely academic. In the wake of a humanitarian crisis, for example, understanding these languages is vital so that information doesn’t get lost in translation.
“Let’s say there’s an earthquake in Armenia. The language spoken in that area is probably not covered with current technology,” said Kevin Knight, an ISI research director and Dean’s Professor of Computer Science, and an expert in machine translation and cryptology.
“We want to be able to look at messages coming from the region and say, These ones are describing the earthquake, these ones are asking for food and water,” he explained. “That way, the aid organization knows, for example, what food and supplies to put on the trucks and where to send them.”
Since 2015, Knight and his team have been working on a four-year Defense Advanced Research Projects Agency (DARPA) project called LORELEI, which stands for Low-Resource Languages for Emergent Incidents. The goal is to create a rapid, automated translation toolkit for unknown languages.
When little-written data exists in a particular language, the decipherment process relies on pairing known with unknown information to assemble the linguistic puzzle. For Knight and his team, tactics include a program that transforms any language’s writing system into the Latin-based alphabet; a name-finding tool that highlights names of people, places and organizations; and using a closely related known language as a stepping-stone to define unknown words.