While we at the Machine Translation team have been seeing increasing traffic to our various offerings over the past few months, we noticed a sudden bump in traffic yesterday. Having grown up on Agatha Christie and Sherlock Holmes, such mysteries are irresistible for me – and a number of other folks on the team were just as curious to find out what caused this sudden bump. We figured that the IE8 Activity/Accelerator, the Messenger Bot, Search translations, Office translations were all showing the same upward trend as the days before and thus were not the specific reason for this bump.
Eventually, we were able to identify one potential reason why we were seeing this spike. Our user community found an oddity in how the machine translation engine processed the translation for several names from English to German. It was to be expected that when the engine translates the name of the candidate of one party to someone from the other party, given the current political atmosphere in the run up to US elections, that it would end up as news. While we certainly welcome all the new users that came by to check this phenomenon out – we wanted to share with our users the reason why such things seem to happen from time to time with statistically trained machine translation systems from us and others.
A Statistical Machine Translation engine is trained on lots and lots of parallel data, that is, data that exists in both a source language (e.g., English) and a target language (e.g., German), where the source and target are translations of one another. Our engine is trained on millions of sentences for each language pair we support. In order to train on a particular corpus of data—maybe a large number of newswire articles in English which have been translated into German—we first have to break that corpus down into sentences. After the corpus is sentence broken, we feed the resulting sentences into a sentence aligner, the sole purpose of which is to find what sentences on the source side align with sentences on the target side. This is no trivial task, since a sentence on one side could conceivably align with one or more sentences on the target (or possibly none at all!). The aligner will sometimes make mistakes, and misalign one sentence with another that is in fact not a translation. This can lead to some mistranslations, especially if there are words in the source and target that are infrequently occurring. Since our translation engine is statistical, it is highly reliant on co-occurrence frequencies between words in the source and target data. If certain words are infrequently occurring—people’s names, for instance, may only occur a few times across a corpus of millions of sentences—the lack of frequency can lead to mistranslations resulting from incorrect “guesses” between source and target (i.e., low probabilities assigned to particular source and target words). This can lead to some comical gaffes in our translation system.
So, that is how the “machine” decided to translate in a way that ended up with the community attributing it to the sense of humor of our team. While we continue to work hard to ensure proper alignments, it is to be expected from a statistical system that is built on millions to billions of words that such a situation could repeat.
The current issue with alignment should now be resolved but we urge our community of users to keep helping us identify any such situations by contacting us through this blog.