Since Papineni et al. (2002) introduced the BLEU metric for machine translation evaluation, string matching functions have dominated the field. These metrics work well enough, but there are cases where they break down and more and more research is revealing their biases. Also, BLEU does not correlate especially well with human judgments, so the quality of MT would benefit from a metric that better captures what makes a good translation.
A recent trend in this direction has been to introduce linguistic information in MT eval. Liu and Gildea (2005) used unlabeled dependency trees to extract headword chains from machine and reference translations to evaluate MT output. To define a few terms, reference translations are human translations that machine translations are compared to during evaluation. In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains. Liu and Gildea compared the headword chains constructed for both machine and reference translations and produced a metric based on comparing the two sets of headword chains. These chains were not annotated with any sort of grammatical relation (subject, object, etc), so they are unlabeled dependencies.
Owczarzak et al. (2007) have extended the work by Liu and Gildea (2005) using labeled dependencies. They parsed the pairs of sentences with a Lexical Functional Grammar (LFG) parser by Cahill et al (2004). In LFG, there are two components of every parse: a c-structure (i.e. a parse tree) and an f-structure, which describes the features of the lexical items. An example of an LFG parse from their paper is given below. F-structures are recursive structures with a head containing all of its constituents. From the f-structure it is easy to construct dependency trees. The bonus is that the f-structure provides the grammatical relations between items in the dependency trees. In the example below, the dependency
subj(resign, john) has the grammatical relation of subject. That is, John is the subject of the sentence headed by the verb resigned.
Their metric is then simply a comparison of these labeled dependency headword chains using precision and recall to compute the f-score (harmonic mean). One of the coolest things in the paper is how they handle parser noise. Statistical parsers are not perfect. They estimate probabilities for rules from labeled data. In natural language, variation is pretty much unlimited, so no matter how big the training corpus, there will always be things the parser has never seen before. Also, we are dealing with imperfect input (by the MT systems or humans) so the problem of noise could be even greater. They address this by running 100 sentences through the various MT metrics they are comparing (including their own) as both the reference machine translation. This produces the “perfect score” for each metric since they are identical. Next, adjuncts are rearranged in the sentence so that the resulting meaning has not been changed, but the structure has. Each MT metric now evaluates the new sentence compared to the original and computes a score. For the LFG parse, the f-structure should remain the same in both cases, so any divergence can be attributed to parser noise. In order to this noise, they used the n-best parses and were able to increase the f-score, bringing it closer to the baesline (ideal). So instead of just comparing the best parse for the reference and machine translation, they combine the n-best parses to compute the f-score.
The result is that they get correlations with human judgments competitive with the best system they compare themselves to (METEOR, Banerjee and Lavie, 2005), beating it for fluency and coming in a close second overall. As far as future work goes, there are quite a few extensions they mention in the paper. The LFG parser produces 32 different types of grammatical relations. In the current setup, they are all weighted the same, but they would like to try tuning the weights to see how that affects the score. Another extension they propose is using paraphrases derived from a parallel corpus. There has been other work done on paraphrasing for MT evaluation (notably Russo-Lassner et al., 2005). One thing I am curious about is whether changing the weight on the harmonic mean would have an impact on correlation. METEOR uses the F9-score while the typical thing to do is F1. It’s not clear that weighting precision and recall equally is the best thing to do.
Interesting stuff, though. I hope they continue the work and maybe we’ll see something in this year’s ACL.
Karolina Owczarzak has confirmed they were using the F1 score and that different F-scores did not lead to significant improvements. I also added the image I forgot to include in the original post.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the Association for Computational Linguistics Conference 2005, pages 65-73, Ann Arbor, Michigan.
Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way. 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26, pages 320-327, Barcelona, Spain.
Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.
Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.
Grazia Russo-Lassner, Jimmy Lin and Philip Resnik. 2005. A Paraphrase-based Approach to Machine Translation Evaluation. Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park, Maryland.