Posts Tagged ‘ruby’

I just published the simple-random ruby gem, which is ported from C# code by John D. Cook.  You can view the source on github or install the gem via rubygems:

gem install simple-random

The gem allows you to sample from the following distributions:

  • Beta
  • Cauchy
  • Chi Square
  • Exponential
  • Gamma
  • Inverse Gamma
  • Laplace (double exponential)
  • Normal
  • Student t
  • Uniform
  • Weibull

Simple examples:

require 'rubygems'
require 'simple-random'

r = SimpleRandom.new
r.uniform # => 0.127064087195322
r.normal(5, 1) # => 5.71972152940515

Wordnik Gem

Posted: 12 March 2010 in Uncategorized
Tags: , , , , ,

Erin McKean

I’ve had my eye on Wordnik for a while, since finding out the excellent lexicographer Erin McKean co-founded it.  Wordnik is the most comprehensive dictionary in the known universe.  Srsly!

They released an API a few months ago and I quickly threw together a gem wrapping it, based on HTTParty.  Tonight I updated the gem for version 3 of the API and simplified it to just a single class with the bare essentials.  You can perform pretty much all of the API calls and get a hash of the results.  It’s nothing major, but will give you a chance to play around with the Wordnik API with almost no work on your part (aside from getting yourself a key).  This change breaks backwards compatibility completely, sorry.

Example usage:

w = Wordnik.new("YOUR_API_KEY")
w.define('gem') # => big hash with all the definitions
w.examples('gem') # => example sentences using "gem"

You can grab the gem off of RubyGems or you can take a look at the source on github.  As always, please let me know if you encounter any problems.

TunkRank Improvements

Posted: 17 February 2010 in Uncategorized
Tags: , , , , , , , , ,

Over the past few weeks, I’ve been working on a number of improvements to TunkRank that I will be rolling out tonight. First, I’ve secured a server to host it on, rather than my old Dell laptop, so reliability should improve and TunkRank is no longer a slave to dynamic DNS problems. Also, my cable company is less likely to hunt me down. TunkRank has gotten some increased attention over the past few weeks, including from Chris Dixon, CEO of the wonderful website hunch:

Twitter could fix the whole follower obsession by highlighting a more meaningful metric like TunkRank.

Awesome! So with this new version, there are a few changes that will immediately impact you, the end-user. I’ll go into the ones that affect you the most first, followed by some technical points of interest for those who care. Then I’ll conclude with a couple of hints at the future.

Changes to TunkRank

First and foremost, I have changed the main score that is reported. Previously I was using a percentile in the range (1-100). This got a lot of objections and created confusion. Partially because I consider the 100th percentile to be the “top-tier” of users, while standardized testing often reports the 99th percentile to mean you performed better than 99% of the population. Also, most people who actually care about their scores enough to use TunkRank are in the 95-100 percentile range, making more fine-grained comparisons difficult. Neal Richter even posted on his blog some suggestions for improving it (quite a while ago, now).

I took a page out of Neal’s book with the log scores, but I also put it in a range where the most influential twitter user (let’s call her MAX) will always have a score of 100. Your TunkRank Score™ is the ratio of the log of your raw score to the log of MAX’s score. So formulas aside, this means your TunkRank score is directly comparable to other users and is always in perspective of the maximum influence exerted by any user in the Twitterverse. Incidentally, comparing users with a difference of seven TunkRank score points means the user with the higher score is about twice as influential.

Accessing the API has also changed slightly, and I apologize to anyone actually using it at the moment. Basically, I am matching the API calls to more closely conform to the URLs used on the web side, and I’m returning more information with each call. TunkRank also supports XML responses in addition to JSON. You can find all of the documentation here.

Some Technical Notes

As part of the move, I’ve decided to transition from using Merb to Rails. My original decision to use Merb was partially as a learning exercise, but also because Merb appealed to me with its being lightweight. However, I often ran into roadblocks because some useful plugin wasn’t supported (or I couldn’t figure out how to make it work in the limited time I had). Sometimes the documentation for Merb was very good and sometimes it was absent altogether. Rails, on the other hand, has a substantial amount of documentation and people are always blogging about the best way to do things — which makes life as a developer much easier. Rails is my day job, so I knew I could transition quickly and easily.

I also migrated from MySQL to PostgreSQL. The main reason is that I love PostgreSQL — plain and simple. They both have their advantages, but MySQL gives me a sense of uneasiness I don’t have with PostgreSQL. I’ve managed to achieve some nice speed improvements as part of the redesign, though that is not to say that the same speed improvements wouldn’t have been possible with MySQL.

I’ve also adopted Resque as my background job-processing library. It is backed by Redis, an advanced key-value store that you can think of as a “data structures server.” The important thing for me is that Resque is fast, has a kick-ass web interface, and integrating with Rails is brain-dead easy.

The Road Ahead

I wrote before about the road ahead for TunkRank, and I have mostly held to it. I have many more ideas I want to expand on, including topic-sensitive influence rankings. I like the ideas in the recent WSDM paper (pdf) by Weng et al, but I have a few new ideas I’m eager to try out. TunkRank scores may also be integrated into Tickery in the near future, thanks to some discussions with Terry Jones of FluidDB.  I’m excited!

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem- Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer - another stemming library, Porter stemmer
  • Ruby WordNet - basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

works-on-my-machine-starburstA while back I ported David Blei’s lda-c code for performing Latent Dirichlet Allocation to Ruby.  Basically I just wrapped the C methods in a Ruby class, turned it into a gem, and called it a day.  The result was a bit ugly and unwieldy, like most research code.  A few months later, Todd Fisher came along and discovered a couple bugs and memory leaks in the C code, for which I am very grateful.  I had been toying with the idea of improving the Ruby code, and embarked on a mission to do so.  The result is a hopefully much cleaner gem that can be used right out of the box with little screwing around.

Unfortunately, I did something I’m ashamed of.  Ruby gems are notorious for breaking backwards compatibility, and I have done just that.  The good news is, your code will almost work, assuming you didn’t start diving into the Document and Corpus classes too heavily.  If you did, then you will probably experience a lot of breakage.  The result, I hope is a more sensical implementation, however, so maybe you won’t hate me.  Of course, I could be wrong and my implementation is still crap.  If that’s the case, please let me know what needs to be improved.

To install the gem:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

Enjoy!

Reblog this post [with Zemanta]

A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]

Learning Scala

Posted: 11 July 2009 in Uncategorized
Tags: , , , , ,
Programming in Scala

Programming in Scala

Two weeks ago, I picked up my copy of Programming in Scala, which had been languishing on my shelf for months.  I pre-purchased it since I went to high school with one of the authors (Lex Spoon).  His mother, incidentally, was also my favorite math teacher.  When I started my new job back in September 2008, I was a total noob at Ruby, so learning that consumed my attention and other languages took a back burner.  Also, I’m always a little reluctant when it comes to learning new languages.  Not because I don’t like to learn them or because it’s difficult — but because it’s a serious investment of time that may be totally wasted.  Sure, Standard ML is an interesting language, but try finding a job doing it.  When I heard that Twitter was using Scala, I figured the time has come to pick up this book.  It also helped that a friend recently started an Atlanta Scala Meetup group.

Aside from being an update on my life, the point of this post is to say that this book is great.  Seldom have I encountered a programming book that achieves this level of depth while still being fun to read.  There are great examples with humor mixed in, the writing is clear and concise, and it’s thorough. What more could you want?

Has anyone else picked up Scala?  (I know there’s a few of you out there lurking!)  Are there any other good books you would recommend?

In the interest of full disclosure, though I know one of the authors, I haven’t actually talked to him in quite a long time (since high school, I think).  I also don’t make any extra money aside from the Amazon affiliate program commission if you happen to buy anything on their site after clicking the book link.

Reblog this post [with Zemanta]

tunkrank-ravenA couple months ago, Daniel Tunkelang posted an algorithm on his blog that attempts to emulate PageRank for Twitter.  I implemented a toy version I dubbed TunkRank, and then suggested that name on his blog.  It got some traction, so I figured what the heck and decided to implement it on TunkRank.com.

Now, there appeared to be a little debate about just whether it is actually emulating PageRank or something else on Daniel’s blog, but I leave it to you to read the comments  on his post if you’re interested. There are also plenty of ideas there on the best way to establish a measure of influence.  I’ll limit the discussion in this post to the basics.

  1. The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.
  2. Your influence depends on the amount of attention your followers can give you.

As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to pick up on your tweets, act on them, retweet them, whatever. You gain influence through the social graph thanks to their influence.

Therefore, your TunkRank score is a reflection of how much attention your followers can both directly give you and give to you.

I implemented this algorithm in Ruby using Merb, MySQL, Capistrano, nginx, and ActiveRecord (and, of course, Git for version control). While my job involves working on a web app, my role has mostly been on back-end NLP stuff. I’m still quite new to the whole Rails-level-web-app-world. For those who don’t know, Merb is a framework similar Ruby on Rails. So similar they are merging and will become Rails 3. ActiveRecord is an Object-relational Mapping (ORM) that Rails uses. The standard ORM for Merb is DataMapper, but I stuck with something I’m more familiar with to limit the variables in my little project.

There are many aspects of getting a web app up and running that I had only heard about in passing — and many more I’m still lost on. But I figured implementing TunkRank would be an interesting place to start.

Phase I – Data Collection

As I said, I implemented TunkRank as a toy the same night that Daniel posted his algorithm. Things seemed to work out quite nicely and I liked it on theoretical grounds as a measure. When I decided to implement the real version, the task of hammering Twitter millions of times suddenly loomed. I suppose I thought there were maybe about 1 million active accounts on Twitter. I have harvested over 2 million before slowing my harvesting down in favor of other development. I have also collected about 40 million edges in the social graph (user A follows user B is one edge). Of the 2 million users I have encountered, those 40 million edges are for only 25% of them. I still haven’t gotten the followers for the remaining 1.5 million. When I do so, I’m sure I’ll discover another million or three users I haven’t seen yet.

I stopped where I did because I was using Ruby’s marshal functionality to dump the social graph to disk. Each dump was weighing in around 250 MB and it was exceeding Marshal’s ability to function. At this point I threw everything into a MySQL database. Bleh! I can’t even describe the pain in the ass that was. If I were to do that again, I would certainly use PostgreSQL, and may still do so. Better yet, I would use some sort of column store database.  But it’s in the MySQL db now and running ok (just ok, not great or even well). MySQL dies quietly and annoyingly at times.  I hate it.

Doing the operations I was doing before in memory in ActiveRecord instead is mind-bogglingly slow by comparison, as you’d expect. Twitter just released the ability to pull all follower ids in one request, which would have made my life easier, but I still can benefit from it going forward. Also, I should have been storing more information about users than just the twitter username. Having to go back and collect that was slow and annoying, but it’s done.

Phase II – Implementing the Algorithm

The algorithm is simple to compute. Check out this gist for a version that calculates it using ActiveRecord. I’d post it here, but WordPress.com sucks and I’m stuck with it. The code uses ActiveRecord more than I’d like, so I rewrote it in SQL using twitter ids.  The gist for that is here.  The #{p} and #{self.twitter_id} are Ruby variables.

Phase III – Doing the Web App

The web app itself is both the most important step and the least fun for me. I very much enjoyed putting together the code to collect the Twitter social graph and then computing the TunkRank scores, but all the nuts and bolts of getting a web app up and running are tedious. Some of it is interesting. Merb isn’t so bad, though I feel like the documentation is shitty. There is an open source Merb book that is missing stuff in all the sections I needed the most. The API documentation isn’t bad, but isn’t easy to search for high level things that you would normally find in a tutorial. Nor should it be — it’s API documentation not a tutorial.

Fortunately, most things were easy enough that I could find a solution eventually. The whole deploying step is foreign to me, and I’m an apache noob so when it comes to balancing mongrel instances I’m like wtf?  Fortunately, I found a few tutorials I was able to piece together.

So the final product is hosted on my 1.8 GHz dual core Dell laptop with 2 GB RAM running Ubuntu 8.10. If you check it out, hopefully it won’t overtax my pathetic server and bring the site down. My data is becoming a little stale so if your username isn’t found, please be patient. When a new person is encountered, I queue them for processing.

Final Thoughts

You can also follow @tunkrank on Twitter. I originally had that account acting as a bot that tweets scores when it encounters influential users. Also,  I was having it auto-follow anyone it grades, but upon reflection, it occurred to me these two things were just plain spammy. I chalk it up to a bad decision in the dead of night. Instead I will just have it follow anyone who follows it.  See my twitter philosophy for how the account will be managed.  I will post updates there on changes, fixes, and up/downtime.

The TunkRank score itself can grow quite large, especially for users with a high number of followers. I present percentiles as the measure, so everything falls in the interval [0,100]. That does not properly reflect that someone in the 100th percentile can be almost 1000 times more influential than someone in the 99th. I’m open to suggestions about how better to show this information. Neal Richter had a few good ideas, perhaps I’ll try one of those.  Still, though, I’m left feeling a little dissatisfied by all of the scoring mechanisms (my own included). As Neal pointed out, his ideas are starting points and I’d like to hear what other people would like to see before proceeding with a different scoring method.

Let me know what you think.

Fun with trees in Ruby

Posted: 20 November 2008 in Uncategorized
Tags: , , , , , , , ,

Like Java and unlike Python, Ruby does not support multiple inheritance.  Also there is no explicit way to create an interface.  One way Ruby lets you get around both problems is by allowing you to include a module in a class.  It’s not quite the same, but with the proper planning you can duplicate the functionality.  Of course, one question you should always ask yourself when trying to shoehorn something from one language into another is if you’re really going about it the right way.

One way of implementing a Java-like interface in Ruby is by creating a module containing the skeleton functions you want the implementing class to implement.

module A
  def method1() raise "not supported"; end
end

class B
  include A
  def method1
    puts "now implemented"
  end
end

Presto, module A is basically a Java interface.  Of course, whether a method has been implemented is not checked until runtime when the method is actually called.  Also if you mix in implemented methods alongside the interface methods, you have something very like an abstract class (minus the compile-time checking).

This came up when I was implementing a bunch of simple tree functions like finding the siblings of a node, finding the grandparent, the descendants, the leaves of a subtree, etc.  It seemed like these were things that should be implemented already.  And why not?  So I threw all of those methods into a module and made it like a Java abstract class.  All you have to implement is a method to call the parent of the current node (or return nil if there is none) and a method to get an Array of the children of the current node.  Your class can pull children from a database, a file, something more complex — it doesn’t matter.  Just implement those two methods and drop in the SimpleTree module and problem solved.

Since I’ve been having fun with gems, I made one for this and slapped it up on GitHub.  To get it, just type:

sudo gem install ealdent-simple-tree

Assuming that you have already done this as some point in the past:

gem sources -a http://gems.github.com

Feel free to extend it, modify it, contribute to it, etc. I’m using the BSD license, which is my current favorite.

Since Ruby is my new favorite toy, I thought it would be fun to try my hand at C extensions.  I came across David Blei’s C code for Latent Dirichlet Allocation and it looked simple enough to convert into a Ruby module.  Ruby makes it very easy to wrap some C functions (which is good to know if you need a really fast implementation of something that gets called alot).  Wrapping a C library is slightly harder, but not horribly so.  Probably most of my challenge was the fact that it’s been so long since I wrote anything in C.

Since the code is open source, I decided to release the Ruby wrapper as a gem on GitHub.  I chose GitHub over RubyForge, because it uses Git and freakin’ rocks.  But GitHub is a story for another day.  Feel free to contribute and extend the project if you’re so inclined.

A basic usage example:

require 'lda'
# create an Lda object for training
lda = Lda::Lda.new
corpus = Lda::Corpus.new("data/data_file.dat")
lda.corpus = corpus
# run EM algorithm using random starting points
lda.em("random")
lda.load_vocabulary("data/vocab.txt")
# print the topic 20 words per topic
lda.print_topics(20)

You can also download the gem from GitHub directly:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

You only need the first line if you haven’t added GitHub to your sources before.