Google announced that it has abandoned Systran as its translation system for the 22 languages it services besides Arabic, Chinese and Russian. Systran is one of the oldest machine translation companies around. When Microsoft launched its service recently, it announced that it would be supplementing its translations with Systran. Systran uses rule-based systems that have been massively tweaked to produce results that most would agree are still pretty crappy. They get some basic stuff right, but once you start venturing off into uncommon word usages and complex constructions, all bets are off. Some translation sites use Systran and others like freetranslation.com use their own system. Babel Fish is perhaps the most well-known site still using Systran.
So Google is switching over to its own statistical machine translation system for all 25 language pairs. Statistical machine translation systems typically look at two different kinds of text: aligned text in two languages (bitext) and monolingual text. The monolingual text is used to build a statistical model of the language so that output will conform to the target language rather than the original. For example, in German, the auxiliary verb comes in second position as in English, but the main verb often comes in final position. Reordering properly isn’t easy and this model helps make the output more natural. Bitexts are texts that have been translated from language to another and then aligned word-by-word. The actual alignment may be done by hand at the sentence level but the vast amount of human effort involved means that at the word level it is usually done automatically. Getting good alignments is an ongoing area of research that is quite far from perfect.
The thing that Google has going for it is that with statistical machine translation, the more data the better. And Google is overflowing with it. It’ll be interesting to see how their systems progress.




1 comment
Comments feed for this article
27 February 2008 at 16:30:24
Is Systran going statistical? « The Mendicant Bug
[...] of Microsoft’s translation service, Altavista’s Babelfish, and quite a few others (including, until recently, Google). In the past, their software has been rule-based, so translation is done with a bilingual [...]