You are currently browsing the category archive for the 'machine learning' category.

This is research I did a while ago and presented Monday to fulfill the requirements of my Masters degree.  The presentation only needed to be about 20 minutes, so it was a very short intro.  We have moved on since then, so when I say future work, I really mean future work.  The post is rather lengthy, so I have moved the main content below the jump.

Read the rest of this entry »

Peter Turney posted recently on the logic of attributional and relational similarity. Attributes are features or characteristics of a single entity. Relations describe some connection between two entities, such as a comparison. We’ll denote a relation between two entities A and B as A:B. A relational similarity between two groups A, B and C,D will be denoted as A:B::C:D. This is the standard SAT-style proportional analogy: A is to B as C is to D. An attributional similarity indicates that two entities share the same attribute (this could be to varying degrees, but in the boolean case, it’s either shared or it isn’t). An attributional similarity between A and B will be denoted as A~B. This is like saying \forall Z, A:Z::B:Z. I’m just giving a brief introduction here, but this is all in Peter’s post to greater detail, so I recommend reading that for more information.

This got me thinking about collaborative filtering (because, well, I’ve been thinking about it all the time for the past two years). Collaborative filtering exploits similarities between users to predict preferences for items the user has not seen. In the case of movie recommendations, like with Netflix, this means that users can recommend movies they have seen to similar users who have not seen those movies. There are many ways of doing this. At the heart of it, however, is this notion of relational and attributional similarity.

A: base user
B: movies rated by A
C: some set of other users
D: movies rated by C

We can’t just say that A:B::C:D, since A and C may be nothing like each other. If we constrain it to users with attributional similarity, then we arrive at the definition of collaborative filtering: A~C & A:B::C:D. Logically, it follows that B~D also holds. See Peter’s post for some properties of proportional analogies that make this more clear.

In the non-binary case, we can choose C to be a set of users whose similarity varies with A. Also, our measure of what exactly constitutes similarity can be any number of different metrics. From here, it seems pretty clear that the limit of collaborative filtering is bounded by the attributional similarity A~C. If (A~C) & (A = C) (complete similarity) then it follows that B = D or else A \neq C. If A \neq C then does it logically follow that B \neq D? I guess it depends on the similarity metric and how we are defining the differences in the sets of movies and the differences in the sets of users.

I wonder if there has been any work done in this area? I wasn’t able to find anything, but maybe I’m just not searching for the right thing. Is it even worth pursuing?

The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency.  Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence.  Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English).  The judges give each sentence a score for both in the range of 1-5, similar to a movie rating.   It became apparent early on that not even humans correlate well with each other.  One judge may be sparing with the number of 5’s he gives out, while another may give them freely.  The same problem crops up in recommender systems, which I have talked about in the past.

It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged.  The better an MT metric correlates with how human judges would rate sentences, the better.  This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems.  If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them?  The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.

Vilar et al. (2007) propose a new way of handling human assessments of MT quality:  binary system comparisons.  Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better.  The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency.  By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.

The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings.  The total number of ratings for the previous method was two per sentence:  O(n), where n is the number of systems (the number of sentences is constant).  Binary system comparisons requires more ratings because the systems have to be ordered:  O(log n!).  In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.

What would be interesting to me is a movie recommendation system that asks you a similar question:  which do you like better?  Of course, this means more work for you.  The standard approaches for collaborative filtering would have to change.  For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies.  Also, people will still disagree with themselves (in theory).  You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure.  You’d have to find some way to deal with cycles like this (ignoring it is one way).

References

Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]

This is an absolutely awesome visualization of the Democratic race so far.

A French-built supercomputer beat a 5 dan Go master in France a couple weeks ago.  Go is a game I became very interested in in January 2007.  I played several thousand games between then and a month ago, when I deleted my account on an online turn-based Go server.  My reason for quitting was that it was taking too much time I should be using for studying, and I was letting it frustrate me too much.  Go is a game that requires mental peace.  You know how when you became a Jedi, you had to let go of your anger?  Same helps for Go.  I’ll take it back up again at some point, because it is a great mental exercise, but my obsession was just becoming too great.

The reason I picked up Go in the first place was that it remained outside the reach of computers.  Of course, it was only a matter of time before it too fell.  And actually, it hasn’t yet.  Just because it beat a 5 dan French master, doesn’t mean it can beat a 9 dan master from China.  So we’ll see.

The method this system used to beat said master was a Monte Carlo method.  These are brilliantly simple in theory.  You basically generate a multitude of random games for a set of moves and score each resulting game state.  The next move with the best scoring set of random game states is chosen.  This can also be thought of as voting.  A set of random models each vote for a move.  The most (or strongest) votes win.  And when 10,000 monkeys agree…

I’ve been messing around with recommender systems for the past year and a half, but not using the kNN (k-Nearest Neighbors) algorithm. However, my current homework assignment for my Information Retrieval class is to implement kNN for a subset of the Netflix Prize data. The data we are working with is about 800k ratings, which is slightly smaller than the MovieLens dataset, which was the previous dataset of choice for research on movie recommender systems. The entire Netflix data set dwarfs the MovieLens set by a factor of 100, so it is quickly replacing MovieLens in papers. The Netflix data is much sparser than the MovieLens, which changes things, as well.

kNN is a fairly simple machine learning algorithm to implement. On a dataset the size of Netflix, it’s still easy to do stupid things that cause it to take forever. Recommender systems typically match users to movies on a scale, which is the user’s rating for that item. In the case of Netflix, the scale is 1 (hate) to 5 (love). For the Netflix Prize, the goal is to guess user’s ratings on a hidden set as correctly as possible (according to root mean squared error (RMSE)). One way of approaching the problem is to create a user-items matrix where the rows correspond to a user, the columns to an item (movie) and the value in each cell is the user’s rating for that item. If the user has not rated the item, it is assigned a zero. Now, we can split this matrix up into vectors, where each row vector represents a user. kNN seeks to find similar users to other users (or similar movies to other movies) according to some metric over these vectors. I won’t bother going into the metrics in detail, but they include cosine similarity, Euclidean distance, and Pearson correlation. The ratings from the chosen k users are combined (either by simply averaging or using some weighted average) to form the prediction for a movie the user has not yet rated.

So on the test dataset for this assignment, I built three models that had the following RMSE scores:

Model 1 0.9831
Model 2 1.0371
Model 3 0.9768

Just guessing the average rating for each movie gives an RMSE of 1.0 in this case, so Models 1 and 3 improve over the baseline, while Model 2 does worse. The best performing teams in the Netflix prize use ensemble methods to combine various models. The simple way to do this is just with a linear combination. So given models {m1, m2, m3} and weights {w1, w2, w3}, the ensemble prediction would be w1m1 + w2m2 + w3m3 (where w1+w2+w3=1.0). This was the first time I had tried ensembles with recommender systems as well, so imagine my surprise when I hit 0.9469 RMSE with my best choice of w’s. Of course, this is nowhere near the number needed to actually claim the prize, but it was a nice demonstration of the power of ensemble methods. I recommend checking out the proceedings of last year’s KDD Cup if you’re interested.

This conference looks like it might be fun if you’re a student working on some area of AI/machine learning and going to a university in the Northeastern US. NESCAI is the North East Student Colloquium on Artificial Intelligence and will be held at Cornell May 2-4, 2008. The deadline for papers is March 7, 2008, so the date is fast approaching. The full CFP is below the fold.

Read the rest of this entry »

The article I mentioned the other day concerning a computer program that confirms dogs communicate has drawn attention from Language Log [first here, more here]. The first was more of a rant from Geoff Pullum that left me feeling like he’s just not much of a dog person (or at the very least, has a healthy skepticism of animal communication claims).  Actually I think he is more angry with the way the media covers this sort of research, but I should stop now before putting too many uninformed words in his mouth.  Mark Liberman goes much more in depth and actually picks apart the paper by Molnar, Kaplan, Roy, Pachet, Pongracz, Doka and Miklosi (the Hungarian scientists mentioned in my previous post).

For anyone interested in machine learning and/or animal communication, I think the Liberman post is worth reading. A few highlights are as follows:

  • no tests were done to see if the computer was significantly more accurate than humans
  • computer accuracy overall was 43% while human accuracy was 40%
  • the article is less about communication than it is about the physiological state used to produce the barks: that is, if a dog is emotionally stimulated, body in a lunging position, his bark will naturally differ from a resting dog

The first two points are important in that the pop science articles reporting the study misrepresented the impact of the research — not very surprising. The third point is more interesting to me, though I have never done anything with animal communication aside from learn about it briefly in a introductory linguistics class.  I had heard about gorillas who could communicate with sign language, and assumed the results were provocative but not controversial. It was fascinating to learn that whether gorillas are doing anything more than memorizing a set of signs that lead to rewards is still debated.

I saw a video via StumbleUpon the other day where a chimp and a human are shown a screen with numbers that are flashed quickly then converted to blank squares. The task is to touch the squares in descending order. The chimps can do it amazingly fast and humans screw it up big time. I attribute this to the idea that animals are “present” or “in the moment,” while humans tend to have a lot going on in their heads that distracts them from the real world. The chimp reacts to the present world, and the humans get bungled up by trying to sort the spatial configuration of the screen as they see it.  They are slowed down by converting the scene into a mental representation, rather than just seeing what is in front of them.  But I’m just theorizing…  I hope someone with more experience with cognitive science can enlighten this off-the-cuff opinion.

Returning to Mark Liberman’s comments about the physiological condition of the dog, I have to partially disagree.  First of all, I definitely agree that the dog’s physical position allows the proper bark to be made.  For Daedalus to produce his beagle howl, his body must be rigid and his head extended upwards.  I have tried to move his body to prevent this bark (because it’s loud as hell on a quiet street at 12am) and have managed to distort it.  However, it is still clearly recognizable as this particular type of bark.  I don’t really see, though, why the territorial instinct of warning other animals away from his territory (wolf heritage) would demand the body rigid, head-back position.  I think the bark demands that position in order to be made, much as our mouth must be in different configurations to make different sounds (ignoring trained exceptions like ventriloquism).  That’s not to say the “fight” body position and the “fight” bark are not interrelated.  After all, a human must be holding a sword and facing massive opposition to yell:

This is SPARTAAAAAAAAAAAAAAAAAAAAAAAAA!!!!!!!

Greg Linden and Daniel Lemire have both written a little about the Netflix Prize and whether the systems that are doing the best are really worth anything. The KorBell system that recently won the Progress Prize consists of 107 different parts in an ensemble system (Note: the team of Bob Bell and Yehuda Koren at AT&T goes by BellKor and KorBell on the Netflix leaderboard). The paper is interesting for two reasons: the ensemble method being used and the fact that only about 3 or 4 of those components are doing the heavy lifting. Actually, I have no idea whether the actual ensemble algorithm they use would be especially interesting to anyone else, but as I have no experience with ensembles in this context, it was interesting to me.

Read the rest of this entry »

Since I work with recommender systems, I’d hardly be doing my job if I didn’t notice things like Google Reader’s new feed recommendations. From the description of how the recommender works on the Google help page (which is unfortunately not very specific):

Your recommendations list is automatically generated. It takes into account the feeds you’re already subscribed to, as well as information from your Web History, including your location. Aggregated across many users, this information can indicate which feeds are popular among people with similar interests. For instance, if a lot of people subscribe to feeds about both peanut butter and jelly, and you only subscribe to feeds about peanut butter, Reader will recommend that you try some jelly.

This sounds like they are using a hybrid recommender system. When you are recommending items (in this case feeds) to users, you can either consider the qualities of the items themselves (content-based) or the behavior of people similar to you (collaborative filtering). The Netflix Prize is a collaborative filtering case for the most part, though it is possible to add in some amount of content.

Read the rest of this entry »

Now here’s a great idea.  StupidFilter is an open-source project with the goal of rooting out and destroying stupid comments in blogs, wikis, YouTube, flickr, and just about any place morons are allowed to voice their opinions.  Pulling this off would allow me to read the comments on Flickr without wanting to rip my eyeballs out.  No longer will I gag when I accidentally allow myself to glance at the comments on a YouTube video.  Yes, the world will be a better place.

The best part will be the complaints by users who are no longer able to leave comments.

“Yo I tired like 333 tiems to comemt and can’t!!!111  watfxup?!?!”

Hat tip.

About Me

Jason M. Adams

My name is Jason M. Adams and I recently graduated with my masters from the Language Technologies Institute at Carnegie Mellon University. My main areas of research were with recommender systems and word sense disambiguation. Now I am on the job market. And I am obsessed with my two dogs.

Calendar

July 2008
S M T W T F S
« Jun    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives

Site Statistics

  • 68,437 reads

Site Information

Contact me: jaso...@gmail.com

Creative Commons License

This work by Jason M. Adams is licensed under a Creative Commons Attribution 3.0 License.

Header image credit seakwenby.

Random Crap