I’ve been messing around with recommender systems for the past year and a half, but not using the kNN (k-Nearest Neighbors) algorithm. However, my current homework assignment for my Information Retrieval class is to implement kNN for a subset of the Netflix Prize data. The data we are working with is about 800k ratings, which is slightly smaller than the MovieLens dataset, which was the previous dataset of choice for research on movie recommender systems. The entire Netflix data set dwarfs the MovieLens set by a factor of 100, so it is quickly replacing MovieLens in papers. The Netflix data is much sparser than the MovieLens, which changes things, as well.

kNN is a fairly simple machine learning algorithm to implement. On a dataset the size of Netflix, it’s still easy to do stupid things that cause it to take forever. Recommender systems typically match users to movies on a scale, which is the user’s rating for that item. In the case of Netflix, the scale is 1 (hate) to 5 (love). For the Netflix Prize, the goal is to guess user’s ratings on a hidden set as correctly as possible (according to root mean squared error (RMSE)). One way of approaching the problem is to create a user-items matrix where the rows correspond to a user, the columns to an item (movie) and the value in each cell is the user’s rating for that item. If the user has not rated the item, it is assigned a zero. Now, we can split this matrix up into vectors, where each row vector represents a user. kNN seeks to find similar users to other users (or similar movies to other movies) according to some metric over these vectors. I won’t bother going into the metrics in detail, but they include cosine similarity, Euclidean distance, and Pearson correlation. The ratings from the chosen *k* users are combined (either by simply averaging or using some weighted average) to form the prediction for a movie the user has not yet rated.

So on the test dataset for this assignment, I built three models that had the following RMSE scores:

Model 1 | 0.9831 |

Model 2 | 1.0371 |

Model 3 | 0.9768 |

Just guessing the average rating for each movie gives an RMSE of 1.0 in this case, so Models 1 and 3 improve over the baseline, while Model 2 does worse. The best performing teams in the Netflix prize use ensemble methods to combine various models. The simple way to do this is just with a linear combination. So given models {m1, m2, m3} and weights {w1, w2, w3}, the ensemble prediction would be w1m1 + w2m2 + w3m3 (where w1+w2+w3=1.0). This was the first time I had tried ensembles with recommender systems as well, so imagine my surprise when I hit 0.9469 RMSE with my best choice of w’s. Of course, this is nowhere near the number needed to actually claim the prize, but it was a nice demonstration of the power of ensemble methods. I recommend checking out the proceedings of last year’s KDD Cup if you’re interested.

Can you please elaborate on the math behind the blending?

Let’s say for movie

m, there are the following predicted ratings based on the above mentioned models:3.47, 4.09, 2.75

With the weighting scheme I mentioned, where the weights are {0.45, 0.35, 0.2}, the blended rating would be 3.47 * 0.45 + 4.09 * 0.35 + 2.75 * 0.2 = 3.54. The weights have to sum to one, or else you would need to normalize the final score in some way.

This is just one way of doing it, but I think it’s fairly typical since it works well and is really easy to do a grid search over the weights on development data to optimize your parameters. It does however make your final model less explainable. Compare this odd blending of models to simple kNN where you can say, well I predicted this rating because k closest users said it should be this. In the real-life case above, a model that performs worse is given a higher weight than the best model. Why?

Jason — I wrote that homework assignment & its fun to see someone getting excited about it. Its surprisingly hard to beat the baseline of predicting the average moving rating without looking at the user information at all.

There’s a lot you can do in scaling & normalizing the ratings — some users are more generous on average than others, for example, and taking that into account when you do predictions helps quite a bit. A 5 from someone who gives all 4’s and 5’s means something different than a 5 from a different user.

One thing I didn’t look into at all, but is likely a promising direction, is taking into account time. I would guess ratings come in bursts & someone’s mood at the moment heavily influences their rating generosity. I recently read a blog post on someone who’s doing exactly this with the netflix dataset with much success. (the link escapes me now…)

Good luck with your assignment.

Jason – what confuses me is how you get those weights. Do you have code you can share?

anon:

I manually found the weights, so it was probably a local optimum. First, if you are trying to submit scores to the competition, you are limited in the frequency you can submit, so doing the same thing I did is infeasible. Also, if you have a large number of models, exhaustive grid search becomes expensive. I don’t have code for this part, but it’s not especially difficult.

What you would do if you were submitting to Netflix is this. Hold out some test cases, like the probe set they include. For each of your models, generate a prediction for test cases in the probe set. Let’s call these {M1, M2, …, Mn} for n models. Each model consists of the predictions for the test cases in the probe set. In a grid search, you would cycle through all possible values for the weights and find the best RMSE for the ensemble. So if you have a weight set W = {w1, w2, …, wn}, you would basically have n loops over the possible values of w (which you can probably restrict to the interval [-1,1], with the constraint that they all sum 1). So if you were combining two models:

for i = -1 to 1 step 0.1

for j = 1 to size(probe_set)

p[j] = i * M1[j] + (1 – i) * M2[j]

As you can see, each model you add increases the complexity by a factor O(ms) (where m is the number size of the probe set and s is the number of parameter values you are checking), so if you have 100 models, it will take forever this way: O(m*s^99). But you can apply some heuristics and limit the space you search.

One final note: You would only need to do this step once to find the weights to apply to each model. So as long as your models remain the same you don’t have to repeat this step. If you add or change a model, you have to reconstruct this.

Jon:

It was a great homework assignment — one of the more interesting I’ve had in a long while. Kudos!

I tried some normalization, but there were definitely more things to try (and a lack of time to do it). I completely agree about the problems of raters using different mental scales and selectively rating certain movies. I have also heard of people taking time into account, but I can’t remember any details or anything that stood out to me as particularly cool. Time-based normalizations are not a bad idea at all, I wonder how much they would help..

WELL THIS ARTICLE IS REALLY VERY USEFUL FOR ME… I AM A STUDENT OF M.TECH AT GJUS&T WORKING ON RECOMMENDER SYSTEMS… THANKS TO AUTHOR….. SIKANDER