I attended a Matlab training seminar yesterday with the dual topics of “Advanced Matlab Programming” and “Distributed and Parallel Computing.” Of the two, the Advanced section was more interesting, though my original motivation for going was the parallel computing part. In the morning, I felt like it was going to be a waste because my Matlab programming skills are weak, and if my advisor had not strongly suggested I attend, I might’ve skipped it. I’m glad he did, because it was surprisingly enjoyable and I felt like it was right on my level. This might be because programming in Matlab isn’t especially hard or different from other programming languages and I know enough to get by already. Or it might be because Matlab is becoming a little more like Python.
Today is the official opening day of GWAP: Games with a Purpose. This is one of two research projects I have been working on for the past few months, though my involvement with GWAP so far has only been in the form of attending meetings, minor testing, and offering my sage gaming advice (and by sage, I mean the herb). GWAP is the next phase in Luis von Ahn’s human computation project. If you visit and play some games, not only will you be rewarded with a good time, but you’ll be helping science! Science needs you. To play games. Now.
The Idea
Artificial intelligence has come a long way, but humans are still far better at computers at simple, everyday tasks. We can quickly pick out the key points in a photo, we know what words mean and how they are related, we can identify various elements in a piece of music, etc. All of these things are still very difficult for computers. So why not funnel some of the gazillion hours we waste on solitaire into something useful? Luis has already launched a couple websites that let people play games while solving these problems. Perhaps you’ve noticed the link to Google Image Labeler on Google Image Search? That idea came from his ESP game (which is now on GWAP).
The Motivation
What researchers need to help them develop better algorithms for computers to do these tasks is data. The more data the better. Statistical machine translation has improved quite a bit over the past few years, in large part due to an increased amount of data. This is the reason why languages that are spoken by few people (even those spoken by as few as several million) still don’t have machine translation tools: there is just not enough data. More data means more food for these algorithms which means better results. And if results don’t improve, then we have learned something else.
The Solution
Multiple billions of hours are spent each year on computer games. If even a small fraction of that time were spent performing some task that computers aren’t yet able to do, we could increase the size of the data sets available to researchers enormously. Luis puts this all a lot better than I can, and fortunately, you can watch him on YouTube (below).
So, check it out already.
The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency. Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence. Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English). The judges give each sentence a score for both in the range of 1-5, similar to a movie rating. It became apparent early on that not even humans correlate well with each other. One judge may be sparing with the number of 5’s he gives out, while another may give them freely. The same problem crops up in recommender systems, which I have talked about in the past.
It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged. The better an MT metric correlates with how human judges would rate sentences, the better. This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems. If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them? The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.
Vilar et al. (2007) propose a new way of handling human assessments of MT quality: binary system comparisons. Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better. The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency. By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.
The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings. The total number of ratings for the previous method was two per sentence: O(n), where n is the number of systems (the number of sentences is constant). Binary system comparisons requires more ratings because the systems have to be ordered: O(log n!). In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.
What would be interesting to me is a movie recommendation system that asks you a similar question: which do you like better? Of course, this means more work for you. The standard approaches for collaborative filtering would have to change. For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies. Also, people will still disagree with themselves (in theory). You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure. You’d have to find some way to deal with cycles like this (ignoring it is one way).
References
Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]
I attended some of the final presentations of an undergrad class on Game Programming today with a friend. We went in expecting something more like a poster session, where people are arrayed around a room showing their work off to a few people who managed to crowd around them. The poster session is ideal for brief browsing, because you can skip anything you’re not interested in. Instead, it was a series of power point presentations followed by an on-screen demo.
Mayhaps you have used the Facebook app Likeness. It’s a fluff app, but has wide appeal since it does two things most people like: easy quizzes and comparisons with our friends. The graphic design that went into the app is a bit low-scale, but it gets the job done. If you haven’t used it, the concept is simple. You are presented with a quiz topic, like “What’s your addiction?” You are then presented with ten items that you must rank in the order specified by the question page (usually most to least favorite, or whatever). Once you have ranked the ten items, you are shown a screen that easily allows you to goof up and spam all your friends. But after that, it produces some sort of similarity score between you and all your friends who have taken it. I’ve never had a similarity below 46% and never one above 98%.
But it got me thinking, how exactly are they measuring this similarity?
I wonder how many blog posts have this title? It’s just the catchy thing we bloggers love. I originally started with “Rowling Howling” and Google was saying twelve results (from blogs) but only three recently. I updated it to yowling when I saw there were no results from Google Blog Search.
Anyhow, Orson Scott Card, author of my beloved Ender’s Game, has a nice diatribe (oxymoron?) against J. K. Rowling’s latest misdeed (and I’m just hearing about this). The word diatribe often has an unsavory connotation against the issuer of said diatribe, but I want to be clear from the start that I think Card is perfectly in the right.
Apparently, some poor schmuck published a book that acts as a reference for the Harry Potter series called Harry Potter Lexicon. Said schmuck, according to Rowling, simply rearranged her work and so it represents a violation of copyright. The terrible yowling that JKR has committed during the course of this utter debacle is truly shameful. I loved the Harry Potter series, but since its completion, she is going downhill. I’m going to have to agree with Card here, I think she wants to be taken seriously. Why can’t people be content with mad cheddar? Would people be happier if they had the respect of millions of people rather than millions of dollars? People always say money can’t buy happiness, and it would seem to be correct, since she is nickle-and-diming this poor fool who raised his head an inch above the crowd-line. Just for once, I’d like someone to prove it to me (by giving me millions of dollars). The deal is, if I stay happy, I get to keep the money.
And completely unrelated, I decided I liked the word “crowd-line” and checked to see if it’s available. Unfortunately, the .com variation is taken, though .org is free. Estibot guesses the .com is worth $140 (compare that to mendicantbug.com, which is worth a whopping $340). Crowd-line with a dash dot com is available, though.. Interestingly, Go Daddy is selling .info domains for $0.99 a year. Is that because they are trash and the refuge of spammers and online biz marketers? The only domain extension more reprehensible is .biz itself, which they are selling for even more. The day I type a .biz address into an address bar is the day I leave the interwebs for good.
I think end-of-semester stress is making me grumpy.
NLP app idea: construct random songs by scraping lyrics websites and stringing together common phrases. It’s a Pandora night for me and here were a couple lyrics that struck me as particularly meaningful. Both by Regina Spektor, introduced to me by Pandora before she became (semi-)famous.
And then you take that love you made
And stick it into some
Someone else’s heart
Pumping someone else’s blood
- “On the Radio”
Beneath the stars came fallin’ on our heads
But they’re just old light, they’re just old light
- “Samson”
I love how she takes the beautiful image of stars falling on their heads and strips it bare of all romanticism and attached meanings, exposing them for what they are: old light.
My new favorite band (thank you, Pandora): the Dresden Dolls. The band is a Boston duo with vocals by Amanda Palmer, who is supposed to be releasing an album this year with some collaboration by Ben Folds. They describe themselves as Brechtian (as in Bertolt) punk cabaret, which actually seems to fit. The lyrics are occasionally self-referential, often bitter and always insightful. The music is a blend of piano, carnival music, and the 1920’s. Plus a million other things. So cool.
Note, the youtube version of “Coin Operated Boy” is about a minute short. If you can get your hands on the full version, I find it much better. Another song I love below.
We went to a Pittsburgh Pirates game tonight. They got their asses handed to them by the Phillies. It was so sad. Donna’s brother and his wife were visiting and their husky has separation anxiety. In the 7th inning, we got a call from our neighbor that an old woman came around complaining and saying she was going to call the cops. Our dogs are normally pretty quiet, but the husky was in a new situation and away from her owners, so poor thing was going nuts.
Got a pic of the parrot mascot at the game. Proof that I occasionally watch sports.





