This is the question I will have to answer over the next few weeks.
One of my classes this semester is the Advanced Machine Translation Seminar (and I hope that link works outside of CMU). Each of us who has registered for the class will present a certain topic in MT and then do a literature review about it by the end of the semester. Originally I had wanted to cover how word sense disambiguation (WSD) has been applied to statistical machine translation, but that overlapped with another topic on bringing in context to MT. In simple terms, WSD is just the task of figuring out which of the many definitions a word has applies in the given circumstances. WSD systems use the context around the word to determine its sense. Thus, it is just another way of bringing context into MT. We determined there was no clear way of separating the topics so that I could still do that, so since mine was the more specific it seemed reasonable to me that I should change topics. No one else is presenting on machine translation evaluation (MT Eval), so I opted for that.
MT Eval is actually a pretty vibrant topic at the moment. For some quick background, machine translation systems produce woefully inadequate translations much of the time. If you have any doubt of this, try to translate a random web page using any of the many free online services. You will get many disfluencies, untranslated words, downright gibberish, and much worse. Not all of it will be bad, of course, but much of it will be. It is a hard problem, and many MT researchers believe it to be AI-complete (the Wikipedia article mentions MT explicitly). In order to improve machine translation, you need some way to automatically evaluate how well you are doing. Currently this is done using automatic metrics that compare machine output to (usually multiple) human translations (aka reference translations). The most commonly used metric is BLEU (pdf), but a rising star is METEOR, developed in part by one of my professors. I won’t go into these metrics any further here at the moment, and I recommend interested parties check out the papers. What these metrics aim to do is gauge how similar the machine output is to the reference translation(s).
The problem with MT Eval is that in order to be able to automatically tell whether something is a good translation, we would have to know exactly what goes into making a good translation (and by good I mean human-level). If we could do that, we would have solved MT!
More to come.