You are currently browsing the monthly archive for April, 2008.
NLP app idea: construct random songs by scraping lyrics websites and stringing together common phrases. It’s a Pandora night for me and here were a couple lyrics that struck me as particularly meaningful. Both by Regina Spektor, introduced to me by Pandora before she became (semi-)famous.
And then you take that love you made
And stick it into some
Someone else’s heart
Pumping someone else’s blood
- “On the Radio”
Beneath the stars came fallin’ on our heads
But they’re just old light, they’re just old light
- “Samson”
I love how she takes the beautiful image of stars falling on their heads and strips it bare of all romanticism and attached meanings, exposing them for what they are: old light.
My new favorite band (thank you, Pandora): the Dresden Dolls. The band is a Boston duo with vocals by Amanda Palmer, who is supposed to be releasing an album this year with some collaboration by Ben Folds. They describe themselves as Brechtian (as in Bertolt) punk cabaret, which actually seems to fit. The lyrics are occasionally self-referential, often bitter and always insightful. The music is a blend of piano, carnival music, and the 1920’s. Plus a million other things. So cool.
Note, the youtube version of “Coin Operated Boy” is about a minute short. If you can get your hands on the full version, I find it much better. Another song I love below.
We went to a Pittsburgh Pirates game tonight. They got their asses handed to them by the Phillies. It was so sad. Donna’s brother and his wife were visiting and their husky has separation anxiety. In the 7th inning, we got a call from our neighbor that an old woman came around complaining and saying she was going to call the cops. Our dogs are normally pretty quiet, but the husky was in a new situation and away from her owners, so poor thing was going nuts.
Got a pic of the parrot mascot at the game. Proof that I occasionally watch sports.
This is an absolutely awesome visualization of the Democratic race so far.
![]() Created by OnePlusYou |
Apparently, I run a rather clean shop. Whodathunkit. And probably most of the cussing comes from my posts on brainfuck.
[Via]
I have been interested in alien (invented) languages since my first brush with elven in the Lord of the Rings. I checked out The Klingon Dictionary from the library in high school and currently own a copy of it and The Languages of Middle Earth
. During high school, I nerdily amused myself by attempting to develop a language for Antarians, which involved gutturals and whistles. Speaking it myself was nearly impossible and I would occasionally practice, trying to go from a growling sound to a whistle as quickly as my human apparatus would permit. I imagine the average passerby might have considered calling the police to have me committed, or at least checked for rabies.
New Scientist has a brief article about the possibility of actually preparing for what alien languages might be like. The argument that Terrence Deacon of UC Berkeley makes (according to the article) is that language serves a purpose. It is a communication system for describing the world and since the world is in some way a fixed point of reference (though perception of the world is not), then abstract symbolism is a feature common to all languages.
At one point, the study of xenolinguistics would have been a dream job for me. A nice office at NASA, a field that will probably never be verifiable. Could you ask for more?
At first, I found it annoying. But gradually, it grew on me. And now I like it. A friend said that I’d likely be sipping thorazine shakes at my local asylum tomorrow.
I was asked recently about the motivation for Abney’s DP (determiner phrase) hypothesis. That is, that determiners are not part of English noun phrases but head up their own phrases of which NPs are complements. I couldn’t remember the justification I was given in my Syntax I class, so I went back to the textbook (Syntax: A Generative Introduction by Andrew Carnie). I found the following interesting excerpt:
“… for lack of a better place to put them, we put determiners … in the specifiers of NPs. This, however, violates one of the basic principles underlying X-bar theory: All non-head material must be phrasal. Notice that this principle is a theoretical rather than an empirical requirement (i.e., it is motivated by the elegance of the theory and not by any data), but it is a nice idea from a mathematical point of view, and it would be good if we could show that it has some empirical basis.”
This clashes a bit with my empirical sensibilities. It represents very much the rational point of view in linguistics, that we can probe our own understanding of language by judging what we perceive to be grammatical or ungrammatical. The empiricist view would look at it from another angle: does it appear in data? So the theoretical view might be “nice” but if it is not supported by the data, it is crap.
Treebanks don’t use DPs (at least none that I’ve seen), so automatic parsers typically have no concept of them. I wonder if they would add any value? I’m guessing they would just run into sparsity issues since another set of tags have to be estimated. But who knows, the extra structure might be helpful in complex situations.
But if you must vote, vote for Nader.
According to Gartner, these will keep us busy for the next 25 years.
- Eliminate need to recharge batteries on wireless devices
- Improved parallel processing (at the PL and OS levels)
- Gesture detection
- Speech-to-speech machine translation
- Long term persistent storage
- 100-fold increase in programmer productivity
- Identifying the financial consequences of IT investment
I think numbers 1-3, 5, and 6 are almost certainly doable (though they all lie outside of my expertise). Number four will at least make very long strides towards being widespread and easy to use. I seriously doubt it will be perfect (and by perfect I mean as good as a trained translator). Number 7 I have no idea about, but getting management to understand the exact benefits of IT has been elusive for the past twenty-five years. I doubt IT managers even have that kind of understanding about it. There are so many variables. As humans are made to become more and more slaves to their corporate overlords (i mean protectors), perhaps prodcutivity will become more predictable.
The lack of posts lately is due to the semester winding down. And by winding down, I mean grinding me up.
So in lieu of something substantive, here is another re-envisioning of Star Wars. I posted about Steampunk Star Wars a couple weeks ago, but this one is slightly more recent — taking its inspiration from World War II. Boba Fett and the Stormtrooper are my favorites. I can imagine a Death Star (not featured) that looks like a giant steel ball with steel panels bolted together. The weapon dish would be steel mesh. Many antennae would protrude from the poles.
A French-built supercomputer beat a 5 dan Go master in France a couple weeks ago. Go is a game I became very interested in in January 2007. I played several thousand games between then and a month ago, when I deleted my account on an online turn-based Go server. My reason for quitting was that it was taking too much time I should be using for studying, and I was letting it frustrate me too much. Go is a game that requires mental peace. You know how when you became a Jedi, you had to let go of your anger? Same helps for Go. I’ll take it back up again at some point, because it is a great mental exercise, but my obsession was just becoming too great.
The reason I picked up Go in the first place was that it remained outside the reach of computers. Of course, it was only a matter of time before it too fell. And actually, it hasn’t yet. Just because it beat a 5 dan French master, doesn’t mean it can beat a 9 dan master from China. So we’ll see.
The method this system used to beat said master was a Monte Carlo method. These are brilliantly simple in theory. You basically generate a multitude of random games for a set of moves and score each resulting game state. The next move with the best scoring set of random game states is chosen. This can also be thought of as voting. A set of random models each vote for a move. The most (or strongest) votes win. And when 10,000 monkeys agree…
Most people have at least a passing familiarity with information trapping, if not the term itself. That is, most people who are early adopters of new technology, technogeeks, etc. In a nutshell, it is the practice of collecting information from the web as it happens. Subscribing to rss feeds, setting up Google alerts, and using FreeAlert to find free stuff on craigslist are all examples of information trapping. If entering a query in a search engine is fishing for information, using one of these (and many other services) is setting a trap for information.
I think this is an area that is going to be taking off in the next few years for people in various industries that are expected to keep up with the latest trends.

Bah, more spa — wait just a second! Cufflinks! Sweet!
The second workshop on large scale recommender systems will be at SIGKDD in Las Vegas this year. One of the main topics is the Netflix competition and Jim Bennett of Netflix is one of the co-chairs, so there should be some interesting stuff in that area. Plus all the other cool stuff going on with recommender systems (yes, there is other work being done on them!). Another of the co-chairs is Daniel Lemire, whose blog never disappoints.
The paper deadline for the workshop is May 30th.
I have no idea why, but four A-10 Warthogs made several circuits around the skies of Pittsburgh today. They are quite noisy, subsonic jets used by the military against armored vehicles and ground positions. The last time I had seen one outside of an air show or museum was when I was kid camping in Sumter National Forest in South Carolina. A couple A-10’s from a local air base were doing some target practice. Their tank-busting guns sound like a giant dumpster slamming from far off. At first, we had no idea what the sound was coming from, so we joked it was the lizard man.
…if you don’t even come to the station?
Via Sarah Wood at Swivel: high school graduation rates for the largest school districts in the country are apallingly low. In Detroit, the lowest of the low, the graduation rate was a meager 24.9% for the 2003-04 year.
I don’t even have words to describe how bad this is, and Detroit is not alone.
Back in the so-called good old days, when American meals weren’t trendy blends of saffron and wild salmon, we ate meat and potatoes. And corn. And maybe cornbread if you lived in the South, or else this dry, flavorless, yellow cornish bread if you lived in the North. Corn was a common side dish on our plate when I was growing up and corn on the cob was a treat. Ever since this so-called biofuel star has been rising, I’ve been dreading the coming Apocalypse on corn prices. No more tasty side-dish! Now imagine if the cornerstone of your entire meal system was corn. Maybe you’d imagine yourself in Mexico, where tortillas are made of … well, corn.
A new friend I met online (and probable new student to the LTI) sent me a link to an article in Time, “The Clean Energy Scam.” The Amazon is being torn down to provide land for ranchers and crops, most of which are biofuels. Biofuels are supposed to be great because they are a renewable resource. Does anyone else see the problem here? It’s like killing babies to feed people. EPIC FAIL.
And it looks now like all the deforestation in the Amazon might be leading to a local climate shift. Instead of rain forest, it might become a savannah or desert (savannahs receive little rain, deserts less still). So these so-called green energies are leading to deforestation and potentially the destruction of the entire Amazon rain forest. Does that make global warming better or worse?
It is becoming more and more clear that anything with the label green is anything but green. They should be called brown. And the whole system rides on the back of oil: plastics in windmills and solar panels, transportation for everything that is made and moved anywhere. All of these brown products depend on oil to be made and usually at such a high cost that it takes literally decades for them to pay themselves off, far exceeding their own life expectancy. It’s not even clear that the oil cost of producing most of these things is exceeded by the carbon savings they deliver. It’s certainly not a requirement before attaching a green label to something.
![]() |
Green is just another excuse for rampant consumerism. The only good thing here is that it shows people recognize there is a problem, but as usual, corporations have stepped in and deluded people en masse. The problem escalates, and we are going to pay. In the future, the typical American meal will consist of ashes.
I’ve been messing around with recommender systems for the past year and a half, but not using the kNN (k-Nearest Neighbors) algorithm. However, my current homework assignment for my Information Retrieval class is to implement kNN for a subset of the Netflix Prize data. The data we are working with is about 800k ratings, which is slightly smaller than the MovieLens dataset, which was the previous dataset of choice for research on movie recommender systems. The entire Netflix data set dwarfs the MovieLens set by a factor of 100, so it is quickly replacing MovieLens in papers. The Netflix data is much sparser than the MovieLens, which changes things, as well.
kNN is a fairly simple machine learning algorithm to implement. On a dataset the size of Netflix, it’s still easy to do stupid things that cause it to take forever. Recommender systems typically match users to movies on a scale, which is the user’s rating for that item. In the case of Netflix, the scale is 1 (hate) to 5 (love). For the Netflix Prize, the goal is to guess user’s ratings on a hidden set as correctly as possible (according to root mean squared error (RMSE)). One way of approaching the problem is to create a user-items matrix where the rows correspond to a user, the columns to an item (movie) and the value in each cell is the user’s rating for that item. If the user has not rated the item, it is assigned a zero. Now, we can split this matrix up into vectors, where each row vector represents a user. kNN seeks to find similar users to other users (or similar movies to other movies) according to some metric over these vectors. I won’t bother going into the metrics in detail, but they include cosine similarity, Euclidean distance, and Pearson correlation. The ratings from the chosen k users are combined (either by simply averaging or using some weighted average) to form the prediction for a movie the user has not yet rated.
So on the test dataset for this assignment, I built three models that had the following RMSE scores:
| Model 1 | 0.9831 |
| Model 2 | 1.0371 |
| Model 3 | 0.9768 |
Just guessing the average rating for each movie gives an RMSE of 1.0 in this case, so Models 1 and 3 improve over the baseline, while Model 2 does worse. The best performing teams in the Netflix prize use ensemble methods to combine various models. The simple way to do this is just with a linear combination. So given models {m1, m2, m3} and weights {w1, w2, w3}, the ensemble prediction would be w1m1 + w2m2 + w3m3 (where w1+w2+w3=1.0). This was the first time I had tried ensembles with recommender systems as well, so imagine my surprise when I hit 0.9469 RMSE with my best choice of w’s. Of course, this is nowhere near the number needed to actually claim the prize, but it was a nice demonstration of the power of ensemble methods. I recommend checking out the proceedings of last year’s KDD Cup if you’re interested.
By now, the April Fools’ Day blog post shtick is so done that it has been reduced to one bad joke after another. It’s too much! I will be joining the Kloonigames Alliance Against Awfully Horrible Need-a-word-for- jokes-starting-with-N (KAAAHN).











