Jun 26, 2012

Statistical Machine Translation: Wave of the Future, Flop of the Century, or Something In Between?

written by Keith Blasing

As many of you will know, the newest wave of machine translation tools is based on a different principle than the old computer translators that returned some notoriously bad results (e.g. the urban legend about “Out of sight, out of mind” going into Russian as something like “Invisible maniac”: http://www.snopes.com/language/misxlate/machine.asp). The new principle uses an ever-increasing number of existing translations to populate an enormous database of corresponding words and phrases in different languages. In theory, this should produce increasingly accurate translations as more content is added and the software can find more accurate, or at least more common, matches for larger chunks of text. In practice, there are fundamental issues with this approach that will make machine translators far inferior to skilled human translators for the foreseeable future.

First, the “garbage in, garbage out” problem. Google Translate has no quality screening system for the translations it puts into its database. If someone translated an article and accidentally rendered “apple” as “banana,” the mistake will get reproduced in the machine translator. For example, as of this writing, Google Translate believes that the word “Tell-Tale” in E. A. Poe’s “Tell-Tale Heart” should be rendered in Russian as “Signaling Device.” Why? Because someone, somewhere translated “tell-tale” as “signaling device,” and there is no mechanism to filter this out of the results.

In an ideal world, the “garbage-in, garbage-out” problem is a temporary thing. Eventually someone will enter a few Russian versions of Poe’s “Tell-Tale Heart” into the Google Translate database, and it will be able to spit out the usual translation of the story’s title (“Serdtse-oblichitel”). But it will also retain the “signaling device” translation in its database. Which brings us to the next fundamental problem with the new machine translators: as the database grows, there is more pressure on the user to choose the best of many possible translations. Want to translate the word “support” into Russian? Google Translate currently gives you 18 verbs and 15 nouns to choose from. Unlike the “garbage-in, garbage-out” problem, which in theory will get better with time as more information enters the database, the “too-much-information-to-choose-from” problem can only get worse with time. The software is able to calculate which option is most common, of course, but the most common option is not always the correct choice. For that, the user needs to know the language.

There are other problems with Google Translate that are specific to particular languages. If you want to translate “Solar Power” into Russian, Google Translate gives you a perfectly good translation of the words in the genitive case. If you have ever learned an inflected language, you know the importance of what I am talking about – the case of a word or phrase carries all the information about what role it is playing in its context. If you do not know what I’m talking about, then you will not be able to use a “statistical machine translator” like Google Translate to produce an accurate translation to or from German, Russian, Ukrainian, Czech, or any other language with a system of grammatical cases. Again, you have to know the language to make the right choice.

The upshot? Current machine translators are often better than earlier ones because they have enough memory capacity to hold not only individual words, but combinations of words and phrases. This makes them fairly accurate if what you want is the most common translation of a two- or three-word phrase. But if you want reliable accuracy even in something as long as a sentence (let alone an important ten-page document), there is still no substitute for a human being with a thorough knowledge of both the target and the source language.

Can You Profit from Interpreting & Translation Services?

request a quote
or 850.562.9811