Greg Linden and Daniel Lemire have both written a little about the Netflix Prize and whether the systems that are doing the best are really worth anything. The KorBell system that recently won the Progress Prize consists of 107 different parts in an ensemble system (Note: the team of Bob Bell and Yehuda Koren at AT&T goes by BellKor and KorBell on the Netflix leaderboard). The paper is interesting for two reasons: the ensemble method being used and the fact that only about 3 or 4 of those components are doing the heavy lifting. Actually, I have no idea whether the actual ensemble algorithm they use would be especially interesting to anyone else, but as I have no experience with ensembles in this context, it was interesting to me.
Daniel Lemire made the point that RMSE (root mean squared error), the metric used for the Netflix prize, isn’t particularly useful. RMSE is a measure of the error that is computed as follows:
,
where x1 and x2 are the actual and predicted movie ratings. RMSE penalizes larger errors more. Competitors predict ratings for a set of about 2 million movies that Netflix knows the rating for but has kept secret. The score that is reported on the leaderboard is for one part of the set of 2 million movies and there is another set whose score is kept secret and is used as a validation set to settle tie-breakers or choose between in the case of multiple candidates for victory.
So RMSE and other similar metrics (like mean absolute error) are measuring how closely the algorithm can guess the rating some user has already given the movie. It does not evaluate things like how useful offering that recommendation to the user would be or the novelty of the prediction (will the recommendation surprise and delight the user?). Also, it doesn’t explicitly evaluate how quickly these algorithms learn ratings (how many ratings does the system need to see to make a good prediction?).
Then there is the question of whether predicting ratings beyond a certain level of accuracy is at all possible. I think that it’s not. It may be possible to get the predictions on the Netflix set down to the winning number, but I think there is a point for general applications that just can’t be broken and that we are already pretty much there. This idea is not new (Herlocker et al., 2004).
To better relate this idea, let’s consider the data collection method for recommender systems, with Netflix in particular. The user begins by having a list of movies presented to her. She rates the movies she has seen and skips the others. The system begins learning a profile and presents new movies to be rated. These tend to be popular movies at first. Once a profile has been constructed, less popular movies that match the profile are presented and eventually the user has rated a lot of movies they have seen. What about movies the user has seen but the system never presented? The user may search for those movies in order to rate them, but I suspect most people aren’t that dedicated. Even with dedication, remembering every movie you have seen is hard (at least it was for me). This results in a slight skew in the data where the average rating is closer to 4 (on a 5-rating scale) than the average rating for the rating scale (3 in this case). So let’s say our hypothetical user spent an hour or two rating 500 movies.
What would happen if she came back a month later and rated 100 more? Would her ratings behave the same way? Maybe she was feeling optimistic when she first rated the movies and light-hearted movies were remembered more fondly. Now she has dumped her scumbag of a boyfriend and she’s hating the world. Light-hearted movies are crap! These 100 new movies are going to have some variance compared to the 500 movies rated before. The fact is that people generally don’t consistently rate items. This plays hell with recommender systems since they try to find consistencies and predict ratings based on them.
The point of this discussion has been to illustrate the point that natural user variance may prevent systems from improving beyond a certain point using metrics like RMSE and mean absolute error. Going back to the Netflix issue, this means that improving rating prediction accuracy by 10% is trying to break this recommender system equivalent to the sound barrier. If it’s possible, we’re probably just overfitting to the data.
Daniel Lemire said this game got old around 2000, but I’m not sure I agree with that exactly. I think there is some value in finding this barrier. Knowing just how far we can get with accuracy using collaborative and content information is useful in telling us something about the value of ratings. However, we pretty much already know this, so going further will only tell us the degree of the problem, not its existence. So good science? My conclusion: no. Fun science: maybe. Good engineering: yes.
References
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 1 (Jan. 2004), 5-53. [pdf]






3 comments
Comments feed for this article
15 December 2007 at 11:50:55
Chris
Bob Carpenter at Alia-i and Hal Daume at his NLPer blog also blogged about this. I’ll leave the quality assessment up to you guys since you’re far better qualified, but I’ll say this: rarely does anything involving NLP get this kind of press. There’s no such thing as bad press, right?
17 December 2007 at 13:20:06
Ian’s Blog » Netflix Prize - is RMSE a good measurement?
[...] I think a user might have a hard time noticing such a difference in performance. I’m not the first to express concerns of this [...]
12 May 2008 at 23:53:56
MT Eval with Binary Comparisons « The Mendicant Bug
[...] 12 May 2008 in collaborative filtering, computational linguistics, machine learning, machine translation, machine translation evaluation, mt, mt eval, rankings, recommender systems The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency. Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence. Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English). The judges give each sentence a score for both in the range of 1-5, similar to a movie rating. It became apparent early on that not even humans correlate well with each other. One judge may be sparing with the number of 5’s he gives out, while another may give them freely. The same problem crops up in recommender systems, which I have talked about in the past. [...]