The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency. Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence. Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English). The judges give each sentence a score for both in the range of 1-5, similar to a movie rating. It became apparent early on that not even humans correlate well with each other. One judge may be sparing with the number of 5′s he gives out, while another may give them freely. The same problem crops up in recommender systems, which I have talked about in the past.
It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged. The better an MT metric correlates with how human judges would rate sentences, the better. This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems. If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them? The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.
Vilar et al. (2007) propose a new way of handling human assessments of MT quality: binary system comparisons. Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better. The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency. By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.
The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings. The total number of ratings for the previous method was two per sentence: O(n), where n is the number of systems (the number of sentences is constant). Binary system comparisons requires more ratings because the systems have to be ordered: O(log n!). In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.
What would be interesting to me is a movie recommendation system that asks you a similar question: which do you like better? Of course, this means more work for you. The standard approaches for collaborative filtering would have to change. For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies. Also, people will still disagree with themselves (in theory). You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure. You’d have to find some way to deal with cycles like this (ignoring it is one way).
Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]