Stepping back in time in MT Eval from my last post, Liu and Gildea (2005) were among the first to really bring syntactic information to evaluating machine translation output. They proposed three metrics for evaluating machine hypotheses: the subtree metric (STM), the tree kernel metric (TKM), and the headword chain metric (HWCM). STM and TKM also had variants for dependency trees, which HWCM relies on. Owczarzak et al. (2007) extended HWCM from dependency parses to LFG parses. HWCM has attracted more attention since it showed better correlation at the sentence level than either STM and TKM (both versions) and outperformed BLEU on longer n-grams. It’s interesting to note, though, that the dependency-based tree kernel metric performed best of all at the corpus level. Sentence level granularity is typically more important for helping you tune your MT system.
The subtree metric is a fairly straightforward idea. You begin by parsing both the hypothesis and the reference sentences using a parser like Charniak or Collins to get a Penn TreeBank style phrase structure tree. You then compare all subtrees in the hypothesis to the reference trees, thresholding the number of matches by the best match in the reference trees. The formula is given below:
The tree kernel metric uses convolution kernels discussed by Collins and Duffy (2001). For the specifics of this method, I refer you to the respective papers (and I may post on it at a later date), but the general idea is that you can transform structured data (a tree) into a feature vector by using the kernel trick. Finding all subtrees of a tree can be exponential in the size of the sentence, which would make computation infeasible for large sentences. The kernel trick lets us operate in this exponentially-high-dimensional space with a polynomial time algorithm. Once we have constructed the feature vectors for the hypothesis and refernece trees, we can score them with their cosine similarity:
H(T1) and H(T2) are vectors with non-zero values for subtrees (dimensions) that appear in each tree, so the dot product of the two is the number of subtrees in common. The score is computed as the maximum cosine similarity between the hypothesis and the references.
Finally, the headword chain metric (HWCM) relies on dependency parses, which I touched on in my previous post.
In dependency grammars, a tree is built by linking a word to its head. So a determiner would be linked to the noun it modifies, the direct object would be linked to the verb, etc. Each link of this sort is a headword chain of length 2. As you build up the tree, you can construct longer and longer headword chains.
The HWCM score is calculated just like the STM except by comparing headword chains. The difference between the HWCM and the dependency version of the STM is that STM considers all subtrees whereas HWCM only looks at direct mother-daughter relations (no cousins or sisters).
Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. In Advances in Neural Information Processing Systems.
Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization at the Association for Computational Linguistics Conference 2005, Ann Arbor, Michigan.
Karolina Owczarzak, Josef van Genabith, and Andy Way. 2007. Labelled Dependencies in Machine Translation Evaluation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 104-111, Prague, June 2007.