Posts Tagged ‘social graph’

Daniel Tunkelang pointed out a new social search service today dubbed Aardvark.  Social search works by delegating your queries to your social network to elicit a response from them.  Aardvark adds in some algorithmic juice to only ask those in your extended social network who might be able to answer your question.

At first blush, this seems better than just dropping your question on Twitter.  Questions posted to Twitter have the benefit of reaching a large audience quickly, but the disadvantage of being blasted to a bunch of people who probably don’t know or don’t care.  You’re also likely only to reach people in your immediate social graph (depth = 1) and only those who happen to be checking Twitter at the time of your question.  You need a pretty big group of followers on Twitter for that to be a consistently effective method.

Of course, the question remains whether Aardvark can deliver the goods.  It sounds good that you can throw your question out to your social network up to some arbitrary depth.  You’re not spamming everyone with the question either, just the people who have given some signal that they might be able to answer it.  But Aardvark wanted me to invite a bunch of Facebook friends, something I very rarely do.  How effective will it be given that I passed on that option?  As far as I can tell, I’m only connected to Daniel since he’s the one who gave me the invite.

If you’re curious about Aardvark, I have ten invitations available.  Leave me a comment, email me, or send me a tweet and it’s yours.

The Road Ahead for TunkRank

Posted: 6 March 2009 in Uncategorized
Tags: , , , ,

Now that TunkRank has gone live, I am faced with some interesting choices.

First of all, I want to make the code open source. The only barrier to my doing that is that I have passwords saved in version control that really can’t be shared with the outside world. I didn’t even think about that being an issue until I was actually deploying it and saw someone mention it in a blog tutorial. There are ways of getting around this problem, and I’ll have to look into them before I can do it.

Next is the issue of expanding the size of the explored social graph. Right now I have found about 2 million users and know the followers of 500 thousand of those. When I was doing everything in memory, it was very fast for me to expand it and relatively easy to merge them. Now I am using a database (MySQL) and doing operations on it for the social graph are just not fast. So I need something better.

Also, I want to make the TunkRank scores available as a data set or at least via an API, so I need to look into ways of doing that. Merb makes it pretty easy to deliver results as xml or json — I just have to get around to doing it. Right now you can find user jimbob’s TunkRank score either by entering “jimbob” in the search box on the main page or by going to the URL http://tunkrank.com/score/jimbob. Extracting it via json or xml will be just a matter of going to http://tunkrank.com/score/jimbob.format.

I need to provide additional scoring systems other than TunkRank, so that TunkRank can be compared. I’m not sure whether this isn’t something better served by just providing the data set and letting people play around in their own database or if I should provide alternate views. The former is more versatile, the latter will probably reach a larger audience.

Currently I show Google ads on TunkRank, mainly because I have spent a small amount of money on it and wouldn’t mind getting that back. If it starts making any real kind of money, that probably means the traffic has increased significantly and I will need to look at hosting it on EC2 or somewhere. I have no illusions that TunkRank will make me rich. I expect it will make me literally tens of dollars … poorer. :)

Finally, there is the issue of the score I display. I chose to show the percentile ranking because it’s easy to see where you are in comparison to other twitter users. If I just showed you the raw TunkRank score, you would have no frame of reference. My solution was to show you both. The downside is it groups all the “interesting” users into the top three percentiles. @NealRichter has put some thought into this, and I urge you to check out his post, leave some comments and help come up with a scoring mechanism that offers better granularity and yet lets you easily compare yourself to the rest of the world. Thus completes today’s desperate plea for crowdsourcing.

tunkrank-ravenA couple months ago, Daniel Tunkelang posted an algorithm on his blog that attempts to emulate PageRank for Twitter.  I implemented a toy version I dubbed TunkRank, and then suggested that name on his blog.  It got some traction, so I figured what the heck and decided to implement it on TunkRank.com.

Now, there appeared to be a little debate about just whether it is actually emulating PageRank or something else on Daniel’s blog, but I leave it to you to read the comments  on his post if you’re interested. There are also plenty of ideas there on the best way to establish a measure of influence.  I’ll limit the discussion in this post to the basics.

  1. The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.
  2. Your influence depends on the amount of attention your followers can give you.

As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to pick up on your tweets, act on them, retweet them, whatever. You gain influence through the social graph thanks to their influence.

Therefore, your TunkRank score is a reflection of how much attention your followers can both directly give you and give to you.

I implemented this algorithm in Ruby using Merb, MySQL, Capistrano, nginx, and ActiveRecord (and, of course, Git for version control). While my job involves working on a web app, my role has mostly been on back-end NLP stuff. I’m still quite new to the whole Rails-level-web-app-world. For those who don’t know, Merb is a framework similar Ruby on Rails. So similar they are merging and will become Rails 3. ActiveRecord is an Object-relational Mapping (ORM) that Rails uses. The standard ORM for Merb is DataMapper, but I stuck with something I’m more familiar with to limit the variables in my little project.

There are many aspects of getting a web app up and running that I had only heard about in passing — and many more I’m still lost on. But I figured implementing TunkRank would be an interesting place to start.

Phase I – Data Collection

As I said, I implemented TunkRank as a toy the same night that Daniel posted his algorithm. Things seemed to work out quite nicely and I liked it on theoretical grounds as a measure. When I decided to implement the real version, the task of hammering Twitter millions of times suddenly loomed. I suppose I thought there were maybe about 1 million active accounts on Twitter. I have harvested over 2 million before slowing my harvesting down in favor of other development. I have also collected about 40 million edges in the social graph (user A follows user B is one edge). Of the 2 million users I have encountered, those 40 million edges are for only 25% of them. I still haven’t gotten the followers for the remaining 1.5 million. When I do so, I’m sure I’ll discover another million or three users I haven’t seen yet.

I stopped where I did because I was using Ruby’s marshal functionality to dump the social graph to disk. Each dump was weighing in around 250 MB and it was exceeding Marshal’s ability to function. At this point I threw everything into a MySQL database. Bleh! I can’t even describe the pain in the ass that was. If I were to do that again, I would certainly use PostgreSQL, and may still do so. Better yet, I would use some sort of column store database.  But it’s in the MySQL db now and running ok (just ok, not great or even well). MySQL dies quietly and annoyingly at times.  I hate it.

Doing the operations I was doing before in memory in ActiveRecord instead is mind-bogglingly slow by comparison, as you’d expect. Twitter just released the ability to pull all follower ids in one request, which would have made my life easier, but I still can benefit from it going forward. Also, I should have been storing more information about users than just the twitter username. Having to go back and collect that was slow and annoying, but it’s done.

Phase II – Implementing the Algorithm

The algorithm is simple to compute. Check out this gist for a version that calculates it using ActiveRecord. I’d post it here, but WordPress.com sucks and I’m stuck with it. The code uses ActiveRecord more than I’d like, so I rewrote it in SQL using twitter ids.  The gist for that is here.  The #{p} and #{self.twitter_id} are Ruby variables.

Phase III – Doing the Web App

The web app itself is both the most important step and the least fun for me. I very much enjoyed putting together the code to collect the Twitter social graph and then computing the TunkRank scores, but all the nuts and bolts of getting a web app up and running are tedious. Some of it is interesting. Merb isn’t so bad, though I feel like the documentation is shitty. There is an open source Merb book that is missing stuff in all the sections I needed the most. The API documentation isn’t bad, but isn’t easy to search for high level things that you would normally find in a tutorial. Nor should it be — it’s API documentation not a tutorial.

Fortunately, most things were easy enough that I could find a solution eventually. The whole deploying step is foreign to me, and I’m an apache noob so when it comes to balancing mongrel instances I’m like wtf?  Fortunately, I found a few tutorials I was able to piece together.

So the final product is hosted on my 1.8 GHz dual core Dell laptop with 2 GB RAM running Ubuntu 8.10. If you check it out, hopefully it won’t overtax my pathetic server and bring the site down. My data is becoming a little stale so if your username isn’t found, please be patient. When a new person is encountered, I queue them for processing.

Final Thoughts

You can also follow @tunkrank on Twitter. I originally had that account acting as a bot that tweets scores when it encounters influential users. Also,  I was having it auto-follow anyone it grades, but upon reflection, it occurred to me these two things were just plain spammy. I chalk it up to a bad decision in the dead of night. Instead I will just have it follow anyone who follows it.  See my twitter philosophy for how the account will be managed.  I will post updates there on changes, fixes, and up/downtime.

The TunkRank score itself can grow quite large, especially for users with a high number of followers. I present percentiles as the measure, so everything falls in the interval [0,100]. That does not properly reflect that someone in the 100th percentile can be almost 1000 times more influential than someone in the 99th. I’m open to suggestions about how better to show this information. Neal Richter had a few good ideas, perhaps I’ll try one of those.  Still, though, I’m left feeling a little dissatisfied by all of the scoring mechanisms (my own included). As Neal pointed out, his ideas are starting points and I’d like to hear what other people would like to see before proceeding with a different scoring method.

Let me know what you think.