This commercial just caught my eye and made me think about faceted search.
7 Nov
This commercial just caught my eye and made me think about faceted search.
31 Oct
13 Sep
There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python). For Ruby, there are a few resources out there, but they are usually derivative or not as mature. By derivative, I mean they are ports from other languages or extensions using code from another language. And I’m responsible for two of them! :)
There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer. Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB). RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations. But using it, I was able to easily interface with the Stanford POS tagger.
So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit. You can also incorporate a c library using extensions.
Naturally, if I missed anything, no matter how small, please let me know.
16 Aug
This post contains NO spoilers.
I saw The Time Traveler’s Wife with my wife today. I had read the book about a year ago, and had been looking forward to the movie. I wasn’t disappointed — I thought the movie was very moving and captured the spirit of the book, even if it didn’t capture everything. It ignored some dynamics that the book elaborated on and some scenes and details were slightly different.
One thing I was concerned about while watching the movie was just how much I was liking it because I knew all the background in the book, or how much came from the movie. If the former was true, then the movie wasn’t going to be that great an experience for someone who had read it. If the latter was true, then it was a damn good movie. I don’t have the answer to that.
Another concern is how it’s a cultural norm in our society to bash movies based on books, and yet to relentlessly watch them to the point that Hollywood feels compelled to turn every book that sells a few copies into one. Douglas Adams once made the point that he changed the story of the Hitchhiker’s Guide to the Galaxy to match the medium he was writing it for. A story that plays well on the radio can take advantage of completely different things when it is translated to book or movie form. I don’t have the exact quote and searching for that kind of thing is damn near impossible on Google (let me know if you find it).
But that’s an observation I have long taken to heart when watching movies translated from books. Obviously you can’t fit an entire book into 2 hours and still have a story that tells like anything worth watching. You can’t capture the full power of every scene, every nuance, nor every subtlety that a book can. That’s not what the silver screen does well. What it does well (when it is done right) is making you feel in touch with characters and the story. Books do that too, but movies actually put the images before your eyes.
That said, I have never been able to bring myself to read a book based on a movie. I just can’t do it.
4 Aug
I already knew Wolfram|Alpha could do some cool astronomy calculations, like comparing the escape velocities of the Galilean moons. A recent W|A blog post also pointed out that you can calculate the next lunar eclipse. So I tried to see when the next solar eclipse would be for my area and it came up with a partial solar eclipse in 2014. Skip that and go to the next and it turns out there’s going to be a decent one in 2017. As a reminder, I sent an email to myself via FutureMe. It’ll be interesting to see if a) I’m still using gmail in 8 years, b) if FutureMe is still around sending emails, and c) if we can still see the sun. Man, I love W|A.
1 Aug
I’m always a little annoyed I have to implement sorting Map keys by their values myself in Java. It seems like they should be a part of the standard Collections library or something. Maybe they are and I just haven’t seen it? My solution (gist) is based on feedback from Josh in the comments to a previous post. How does that look to you?
1 Aug
When Lazyfeed announced a limited round of beta invites on TechCrunch, I admit, I lusted after them. Only 250? I wanted to be one! But alas, I was put on the waiting list. It’s a decent marketing strategy for building up some hype. When I finally did get my invite, I tried them out for about 5 minutes and fell prey to the distractions of the internet. That was a bad sign, though. Usually a new service can hold my attention for a little while longer. So what happened?
Lazyfeed is a service that lets you enter topics, blogs, twitter, delicious and flickr accounts to form a live streaming lazyfeed. You then get live updates in the form of your tags being updated. Your main screen consists of a bunch of boxes with your topics and then things it guesses are related.
Lazyfeed’s marketing strategy succeeded again by giving me three invites to hand out to friends. I offered them on Twitter, having only one person bite. So here are the other two invites for the adventurous. Get em while they’re hot. If you manage to take one, please comment that you did so, so that I can at least know who you were and we can save someone else the wasted time. I’m just throwing them into the ether like this because I don’t feel like pushing them on Twitter again.
NTI1MzMxMjc5ZVhmUTl5cDBiek1R
OTk5MTUwNjczN3JCLklmZHhjMDdV
Lazyfeed is a lovely service in terms of appearance and ajaxy goodness, but my initial impression is that it ends up being streaming information overload. For one, the topic suggestion feature appears to be fairly naive. Someone correct me if I’m wrong, but it looks a bit like document similarity for topics is done purely by one-for-one matching on tags. Whatever the method, the result of their suggested topics (“Stuff for Lazy Jason”) is stuff like the following:
Granted, it’s a hard problem, but those results are pretty bad. So as I started to write this post lambasting this service, I considered that maybe I was just seeing cold-start problems, and I was being unfair. So I trained it with some additional feeds and topics that are straight-to-the-point of stuff I’m interested in, like sigir2009, topicmodeling, recommendersystems, etc. Tags can contain no spaces, btw, which is why those don’t. When I tried using dashes, like I often do on delicious, it gives no results. I also removed some things that were too general or contained too many spurious results.
Things started improving here, and I actually began to understand what the point of Lazyfeed is. My initial confusion was that “Stuff for Lazy Jason” is stuff that I would want to read right now. Being lazy, I didn’t expect to have to do work to get those things. But “Stuff for Lazy Jason” is a list of topics it thinks I might be interested in. Saving any one of those puts it into my lazyfeed, which is in the bar on the left.
So now what happens is that occasionally it discovers something new related to my interests and it bumps that category to the top of the list and turns it bold again (grayed out topics have been read). Most of my topics are low traffic, so add something like mariahcarey if you want to see this functionality in action. Now we’re getting somewhere. It has actually started being helpful and has found me some stuff that my Google alerts haven’t. Which is weird, and is making me think I need to double check to make sure my Google alerts are working…
My takeaway after using Lazyfeed for nigh on two hours is that it’s an interesting alternative (or even extension) to RSS, but one that still hasn’t crossed the bridge to the next stage in evolution. The idea is solid. Automatically discover stuff in the sea of human knowledge (or human idiocy) and serve it up fresh. The implementation lacks robust topic detection which is unfortunately going to be necessary unless it is to become another source of information overload rather than a useful stream of relevant information. Relevance is an ephemeral thing, given that your information needs change from day to day. Lazyfeed makes it pretty easy to get rid of old topics and add new ones, even if some of their suggestions are still wonky. It’s an interesting recommender system problem with a lot of potential.
30 Jul
Github just announced their own version of the Netflix Prize. Instead of predicting movie ratings, Github wants you to suggest repositories for users to watch. This is different from the Netflix Prize in a number of ways:
Already there have been many submissions. The number one place is currently held by Daniel Haran with 46.9% guessed correctly. Happy hunting, if you decide to compete.
The prizes are a bottle of Pappy van Winkle bourbon and a large Github account for life. The bottle of Pappy is making me consider competing.
30 Jul
A while back I ported David Blei’s lda-c code for performing Latent Dirichlet Allocation to Ruby. Basically I just wrapped the C methods in a Ruby class, turned it into a gem, and called it a day. The result was a bit ugly and unwieldy, like most research code. A few months later, Todd Fisher came along and discovered a couple bugs and memory leaks in the C code, for which I am very grateful. I had been toying with the idea of improving the Ruby code, and embarked on a mission to do so. The result is a hopefully much cleaner gem that can be used right out of the box with little screwing around.
Unfortunately, I did something I’m ashamed of. Ruby gems are notorious for breaking backwards compatibility, and I have done just that. The good news is, your code will almost work, assuming you didn’t start diving into the Document and Corpus classes too heavily. If you did, then you will probably experience a lot of breakage. The result, I hope is a more sensical implementation, however, so maybe you won’t hate me. Of course, I could be wrong and my implementation is still crap. If that’s the case, please let me know what needs to be improved.
To install the gem:
gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby
Enjoy!
16 Jul
A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith. Stemmers are NLP tools that get rid of inflectional and derivational affixes from words. In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed. Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness. Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol. But sometimes that doesn’t make sense. If you’re searching for video game consoles, you don’t want to find documents about consolation. In this case, you need a conservative stemmer.
The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms. It was originally written in Perl, but had been ported to Java. Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.
The code is open source under the Apache 2 License and hosted on github. So please check out the code and let me know what you think. Heck, you can even fork the project and make some improvements yourself if you want.
One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer. That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.