Posts Tagged ‘algorithms’

Java maps and sorting

Posted: 1 August 2009 in Uncategorized
Tags: , , , , ,

I’m always a little annoyed I have to implement sorting Map keys by their values myself in Java.  It seems like they should be a part of the standard Collections library or something.  Maybe they are and I just haven’t seen it?  My solution (gist) is based on feedback from Josh in the comments to a previous post. How does that look to you?

tunkrank-ravenA couple months ago, Daniel Tunkelang posted an algorithm on his blog that attempts to emulate PageRank for Twitter.  I implemented a toy version I dubbed TunkRank, and then suggested that name on his blog.  It got some traction, so I figured what the heck and decided to implement it on TunkRank.com.

Now, there appeared to be a little debate about just whether it is actually emulating PageRank or something else on Daniel’s blog, but I leave it to you to read the comments  on his post if you’re interested. There are also plenty of ideas there on the best way to establish a measure of influence.  I’ll limit the discussion in this post to the basics.

  1. The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.
  2. Your influence depends on the amount of attention your followers can give you.

As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to pick up on your tweets, act on them, retweet them, whatever. You gain influence through the social graph thanks to their influence.

Therefore, your TunkRank score is a reflection of how much attention your followers can both directly give you and give to you.

I implemented this algorithm in Ruby using Merb, MySQL, Capistrano, nginx, and ActiveRecord (and, of course, Git for version control). While my job involves working on a web app, my role has mostly been on back-end NLP stuff. I’m still quite new to the whole Rails-level-web-app-world. For those who don’t know, Merb is a framework similar Ruby on Rails. So similar they are merging and will become Rails 3. ActiveRecord is an Object-relational Mapping (ORM) that Rails uses. The standard ORM for Merb is DataMapper, but I stuck with something I’m more familiar with to limit the variables in my little project.

There are many aspects of getting a web app up and running that I had only heard about in passing — and many more I’m still lost on. But I figured implementing TunkRank would be an interesting place to start.

Phase I – Data Collection

As I said, I implemented TunkRank as a toy the same night that Daniel posted his algorithm. Things seemed to work out quite nicely and I liked it on theoretical grounds as a measure. When I decided to implement the real version, the task of hammering Twitter millions of times suddenly loomed. I suppose I thought there were maybe about 1 million active accounts on Twitter. I have harvested over 2 million before slowing my harvesting down in favor of other development. I have also collected about 40 million edges in the social graph (user A follows user B is one edge). Of the 2 million users I have encountered, those 40 million edges are for only 25% of them. I still haven’t gotten the followers for the remaining 1.5 million. When I do so, I’m sure I’ll discover another million or three users I haven’t seen yet.

I stopped where I did because I was using Ruby’s marshal functionality to dump the social graph to disk. Each dump was weighing in around 250 MB and it was exceeding Marshal’s ability to function. At this point I threw everything into a MySQL database. Bleh! I can’t even describe the pain in the ass that was. If I were to do that again, I would certainly use PostgreSQL, and may still do so. Better yet, I would use some sort of column store database.  But it’s in the MySQL db now and running ok (just ok, not great or even well). MySQL dies quietly and annoyingly at times.  I hate it.

Doing the operations I was doing before in memory in ActiveRecord instead is mind-bogglingly slow by comparison, as you’d expect. Twitter just released the ability to pull all follower ids in one request, which would have made my life easier, but I still can benefit from it going forward. Also, I should have been storing more information about users than just the twitter username. Having to go back and collect that was slow and annoying, but it’s done.

Phase II – Implementing the Algorithm

The algorithm is simple to compute. Check out this gist for a version that calculates it using ActiveRecord. I’d post it here, but WordPress.com sucks and I’m stuck with it. The code uses ActiveRecord more than I’d like, so I rewrote it in SQL using twitter ids.  The gist for that is here.  The #{p} and #{self.twitter_id} are Ruby variables.

Phase III – Doing the Web App

The web app itself is both the most important step and the least fun for me. I very much enjoyed putting together the code to collect the Twitter social graph and then computing the TunkRank scores, but all the nuts and bolts of getting a web app up and running are tedious. Some of it is interesting. Merb isn’t so bad, though I feel like the documentation is shitty. There is an open source Merb book that is missing stuff in all the sections I needed the most. The API documentation isn’t bad, but isn’t easy to search for high level things that you would normally find in a tutorial. Nor should it be — it’s API documentation not a tutorial.

Fortunately, most things were easy enough that I could find a solution eventually. The whole deploying step is foreign to me, and I’m an apache noob so when it comes to balancing mongrel instances I’m like wtf?  Fortunately, I found a few tutorials I was able to piece together.

So the final product is hosted on my 1.8 GHz dual core Dell laptop with 2 GB RAM running Ubuntu 8.10. If you check it out, hopefully it won’t overtax my pathetic server and bring the site down. My data is becoming a little stale so if your username isn’t found, please be patient. When a new person is encountered, I queue them for processing.

Final Thoughts

You can also follow @tunkrank on Twitter. I originally had that account acting as a bot that tweets scores when it encounters influential users. Also,  I was having it auto-follow anyone it grades, but upon reflection, it occurred to me these two things were just plain spammy. I chalk it up to a bad decision in the dead of night. Instead I will just have it follow anyone who follows it.  See my twitter philosophy for how the account will be managed.  I will post updates there on changes, fixes, and up/downtime.

The TunkRank score itself can grow quite large, especially for users with a high number of followers. I present percentiles as the measure, so everything falls in the interval [0,100]. That does not properly reflect that someone in the 100th percentile can be almost 1000 times more influential than someone in the 99th. I’m open to suggestions about how better to show this information. Neal Richter had a few good ideas, perhaps I’ll try one of those.  Still, though, I’m left feeling a little dissatisfied by all of the scoring mechanisms (my own included). As Neal pointed out, his ideas are starting points and I’d like to hear what other people would like to see before proceeding with a different scoring method.

Let me know what you think.

No this isn’t a list of the top 25 memes of all time. It’s a post inspired by the “Top 25″ memes that have been going around Facebook, MySpace, the intarwebs, etc. These memes are fairly simple, as many are. List your top 25 favorite albums in your collection. List your top 25 books. List your top 25 experiences with acne. The lists go on. But, how would you go about producing such a list? Do you just have the top 25 albums in your collection ranked and ready in your mind? Or do you have a crapload of albums and figuring out the top 25 would be a non-trivial task?

A friend approached me with this very problem and he brought up using binary comparisons. I have talked about binary comparisons when dealing with machine translation evaluation before, and many in the academic community have begun to appreciate their value for eliciting human preferences. Performing evaluations using binary comparisons is straightforward: give the evaluator pairs of items and ask which is better according to some subjective metric. In the case of machine translation evaluation, the task is to decide which translation is more human-like.

It ought to be obvious that a major problem with binary comparisons in a data set of any appreciable size is the massive number of comparisons you will need in order to rank every item in the set. For the sake of argument, let’s assume that your comparisons never contradict themselves. Under our assumption, the following does not occur: A is rated better than B, B is rated better than C, and C is rated better than A. This lets us bound the amount of comparisons to O(n log n), since we are basically dealing with a sorting algorithm using the human as the comparison operator. This was my friend’s idea and I really liked it.

But let’s say you have 1000 albums. 1000 * log (1000) = 9966 comparisons! That’s just too much work. Maybe we ought to just rate them on a scale of 1-100 (or whatever) and order them using that. It’s not perfect, but it’s ten times less work. But what if you were dedicated and actually wanted a good ranking?

Binary comparisons lend themselves to binary trees and a binary search tree would be ideal, so heapsort seems like a natural sorting algorithm. But we’re still stuck with O(n log n) to build it. Returning to the idea of “top 25″ memes, we don’t need the ordering of the entire set, just the top 25 items, so we can surely get rid of some of the work.

So rather than doing a full heapsort algorithm, I figured why not use an AVL tree. We don’t need to keep the entire tree balanced — we only need to worry about the right branch off the root which can be limited to 25 nodes. The algorithm is fairly simple (pseudocode):

  • for each item:
  • compare item to root node:
  • if item < root then
    • discard
  • else
    • add to right branch
    • balance the right branch
    • if the number in the right branch exceeds 25, replace the root node with the lowest value and delete that node.

This greatly reduces the number of comparisons needed since you very quickly bubble the best items into the top 25 branch. When a new item is being added to the top 25, only 4-5 comparisons need be made to determine its place. By continuously balancing the right branch, we can keep the height of that side as low as possible on average in order to prevent too many comparisons when new items are added to it.

I couldn’t figure out an easy way to compute the big O for this, so I ran some simulations instead. For a set of 1000 items, on average the user will have to perform 1528 +/- 2 comparisons to produce the top 25 items.

The code for the tree can be found in this gist. The nasty $counter global is just there to help me count the number of comparisons that are being performed for the purposes of the simulation. Some of the AVL Tree code was inspired by Bruno Preiss’s book on algorithms and data structures.

Fun with trees in Ruby

Posted: 20 November 2008 in Uncategorized
Tags: , , , , , , , ,

Like Java and unlike Python, Ruby does not support multiple inheritance.  Also there is no explicit way to create an interface.  One way Ruby lets you get around both problems is by allowing you to include a module in a class.  It’s not quite the same, but with the proper planning you can duplicate the functionality.  Of course, one question you should always ask yourself when trying to shoehorn something from one language into another is if you’re really going about it the right way.

One way of implementing a Java-like interface in Ruby is by creating a module containing the skeleton functions you want the implementing class to implement.

module A
  def method1() raise "not supported"; end
end

class B
  include A
  def method1
    puts "now implemented"
  end
end

Presto, module A is basically a Java interface.  Of course, whether a method has been implemented is not checked until runtime when the method is actually called.  Also if you mix in implemented methods alongside the interface methods, you have something very like an abstract class (minus the compile-time checking).

This came up when I was implementing a bunch of simple tree functions like finding the siblings of a node, finding the grandparent, the descendants, the leaves of a subtree, etc.  It seemed like these were things that should be implemented already.  And why not?  So I threw all of those methods into a module and made it like a Java abstract class.  All you have to implement is a method to call the parent of the current node (or return nil if there is none) and a method to get an Array of the children of the current node.  Your class can pull children from a database, a file, something more complex — it doesn’t matter.  Just implement those two methods and drop in the SimpleTree module and problem solved.

Since I’ve been having fun with gems, I made one for this and slapped it up on GitHub.  To get it, just type:

sudo gem install ealdent-simple-tree

Assuming that you have already done this as some point in the past:

gem sources -a http://gems.github.com

Feel free to extend it, modify it, contribute to it, etc. I’m using the BSD license, which is my current favorite.

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump.  In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)

Morning Madness

Posted: 2 November 2007 in Uncategorized
Tags: , , , , , , , , , ,

Ever get pissed twice before you’ve really even opened your eyes? This is why I shouldn’t read my RSS feeds so early in the morning. At the top of the list is Bush equating Democrats who oppose the war (as if it could be called opposition, anyway) to those who ignored Hitler and Lenin and then Hillary firing back. Am I mad at Bush for making this analogy? No and I think he’s correct, but not in the way he thinks. I’m more angry at Hillary for firing back and not recognizing her own culpability. The Sheepocrats sat back and did nothing four years ago when this war began and passed the Patriot Act before that. They have endorsed the war at every stage since and even their current so-called opposition is luke-warm and putrid with its weasliness. So yeah, they are like people who ignored the rise of Hitler and Lenin. If she had recognized that and said it publicly, it would have done her credit.

 

Next up, I was reading a few bit twiddling hacks and came across a nice one for branchless absolute value [hat tip]. The hacks are all in the public domain, too, so that’s good. He does list the occasional variation that is patented, an enormously helpful fact if you’re producing commercial software. So here is the patented version of the branchless absolute value:


int v; // we want to find the absolute value of v
int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v ^ mask) - mask;

The last ^ (XOR) – (subtract) combination represents the patent. What works also?

r = (v + mask) ^ mask;

As Sean points out, though, the patent probably could be contested if the holder (none other than Sun Microsystems) ever tried to enforce it. So what ticked me off is that such a thing could be patented. I raise my hands in impotent fury at the ludicrousness of software patents. I don’t blame the inventors for them, it’s something you pretty much have to do these days. I blame the system that makes that true.

Update

Did some benchmarks on the two versions of absolute value given above.  Using a 3.06GHz processor, I could run 4 billion absolute values in 18.916 +/- 0.021 seconds for the patented version and 18.906 +/- 0.026 seconds for the free version.  So no need to even bother with the patented version it looks like.

Merge sort fun

Posted: 18 September 2007 in Uncategorized
Tags: , , , , , ,

Suppose you have an array of floating point numbers with each index into the array being an id number corresponding to some external data structure. You want to sort this array, but in doing so you would destroy the references to the id numbers, since the indexes of the array would no longer correspond to the correct id numbers in the external data structure. For example, let’s say we are dealing with customers of a store and each value in the array is their current balance.

balances = {25.61, 13.45, 89.75, 21.2, 96.50}

Each index in the balances array corresponds to an index in some other array, say:

names = {"Marjory Stewart-Baxter", "Hubert Cumberdale", "Barbara Logan-Price", "Jeremy Fisher", "Mable"}

(more…)