Tom Simonite - Wednesday, June 6, 2012, Technology Review (published by MIT)
Languages that aren’t used online risk being left behind. New translation technology from Google and Microsoft could help them catch up.
Sometimes you may feel like there’s nothing worth reading on the Web, but at least there’s plenty of material you can read and understand. Millions of people around the world, in contrast, speak languages that are still barely represented online, despite widespread Internet access and improving translation technology.
Web giants Microsoft and Google are trying to change that with new translation technology aimed at languages that are being left behind—or perhaps even being actively killed off—by the Web. Although both companies have worked on translation technology for years, they have, until now, focused on such major languages of international trade as English, Spanish, and Chinese.
Microsoft and Google’s existing translation tools, which are free, are a triumph of big data. Instead of learning as a human translator would, by studying the rules of different languages, a translation tool’s algorithms learn how to translate one language into another by statistically comparing thousands or millions of online documents that have been translated by humans.
The two companies have both departed from that formula slightly to serve less popular languages. Google was able to recently launch experimental “alpha” support for a collection of five Indian languages (Bengali, Gujarati, Kannada, Tamil, and Telugu) by giving its software some direct lessons in grammar, while Microsoft has released a service that allows a community to build a translation system for its own language by supplying its own source material.
Google first realized it needed to give its system a grammar lesson when trying to polish its Japanese translations, says Ashish Venugopal, a research scientist working on Google’s translation software. “We were producing sentences with the verb in the middle, but in Japanese, it needs to go at the end,” Venugopal says. The problem stemmed from the system being largely blind to grammar. The fix that the Google team came up with—adding some understanding of grammar—enabled the launch of the five Indic languages, all used by millions on the subcontinent but largely missing from the Web.
Google’s system was trained in grammar by giving it a large collection of sentences in which the grammatical parts had been labeled—more instruction than Google’s translation algorithms typically receive.
Venugopal says that, so far, the system can’t handle the underserved languages as well as Google’s existing translation technology can handle more established languages, such as French and German. But, he says, offering any support at all is important for languages that are relatively rare online. “It’s an important part of our mission to make those other languages available on the Web,” he says. “We don’t want people to have to decide whether to publish their blog in their own language or in English. We want to help the world read your blog.”
Microsoft is also interested in helping languages not in common use online, to prevent those languages from being sidelined and falling from use, says Kristin Tolle, a director at Microsoft Research. Her team recently launched a website that helps anyone to create their own translation software, called Translation Hub. It is intended for communities that wish to ensure their language is used online.
Using Translation Hub involves creating an account and then uploading source materials in the two languages to be translated between. Microsoft’s machine-learning algorithms use that material and can then attempt to translate any text written in the new language. Microsoft piloted that technology in collaboration with leaders of Fresno, California’s large Hmong community, for whose language a machine translation system does not exist.
“Allowing anyone to create their own translation model can help communities save their languages,” says Kristin Tolle, a director at Microsoft Research. Machine translation systems have been developed for roughly 100 of the world’s 7,000 languages, says Tolle.
“There is a lot of truth to what Microsoft is saying,” says Greg Anderson, director of nonprofit Living Tongues, which documents, researches, and tries to support disappearing languages. “Today’s playing field involves a digital online presence whether you are community or a company—if you don’t have a Web presence, you don’t exist, on some level.” Anderson says that sidelined languages making a comeback are usually those from communities that have embraced online life using their language.
Margaret Noori, a lecturer at University of Michigan who works to preserve the Anishinaabemowin or Ojibwe, a native American language, agrees, but adds that preserving a language involves more than the Web. “There is a reason to be online in today’s world, but it absolutely must be balanced by songs sung only aloud and ceremonies never recorded.”
Microsoft’s Translation Hub is also aimed at enabling the translation of specialist technical terms or jargon, which general purpose online translation tools do not handle well. Nonprofits could, for example, use it to translate materials on agricultural techniques, says Tolle, and the technology can also be useful to companies that wish to speed up translation of instruction manuals or other material.
“Companies often want to have their data available to them privately and retain their data—not to provide it to someone else that will train a translation system,” she says. Volvo and Mercedes have expressed an interest in testing Microsoft’s Translation Hub, says Tolle.