Top 25 Memes

Posted: 28 February 2009 by Jason Adams in Uncategorized
Tags: , , , , , ,

No this isn’t a list of the top 25 memes of all time. It’s a post inspired by the “Top 25″ memes that have been going around Facebook, MySpace, the intarwebs, etc. These memes are fairly simple, as many are. List your top 25 favorite albums in your collection. List your top 25 books. List your top 25 experiences with acne. The lists go on. But, how would you go about producing such a list? Do you just have the top 25 albums in your collection ranked and ready in your mind? Or do you have a crapload of albums and figuring out the top 25 would be a non-trivial task?

A friend approached me with this very problem and he brought up using binary comparisons. I have talked about binary comparisons when dealing with machine translation evaluation before, and many in the academic community have begun to appreciate their value for eliciting human preferences. Performing evaluations using binary comparisons is straightforward: give the evaluator pairs of items and ask which is better according to some subjective metric. In the case of machine translation evaluation, the task is to decide which translation is more human-like.

It ought to be obvious that a major problem with binary comparisons in a data set of any appreciable size is the massive number of comparisons you will need in order to rank every item in the set. For the sake of argument, let’s assume that your comparisons never contradict themselves. Under our assumption, the following does not occur: A is rated better than B, B is rated better than C, and C is rated better than A. This lets us bound the amount of comparisons to O(n log n), since we are basically dealing with a sorting algorithm using the human as the comparison operator. This was my friend’s idea and I really liked it.

But let’s say you have 1000 albums. 1000 * log (1000) = 9966 comparisons! That’s just too much work. Maybe we ought to just rate them on a scale of 1-100 (or whatever) and order them using that. It’s not perfect, but it’s ten times less work. But what if you were dedicated and actually wanted a good ranking?

Binary comparisons lend themselves to binary trees and a binary search tree would be ideal, so heapsort seems like a natural sorting algorithm. But we’re still stuck with O(n log n) to build it. Returning to the idea of “top 25″ memes, we don’t need the ordering of the entire set, just the top 25 items, so we can surely get rid of some of the work.

So rather than doing a full heapsort algorithm, I figured why not use an AVL tree. We don’t need to keep the entire tree balanced — we only need to worry about the right branch off the root which can be limited to 25 nodes. The algorithm is fairly simple (pseudocode):

  • for each item:
  • compare item to root node:
  • if item < root then
    • discard
  • else
    • add to right branch
    • balance the right branch
    • if the number in the right branch exceeds 25, replace the root node with the lowest value and delete that node.

This greatly reduces the number of comparisons needed since you very quickly bubble the best items into the top 25 branch. When a new item is being added to the top 25, only 4-5 comparisons need be made to determine its place. By continuously balancing the right branch, we can keep the height of that side as low as possible on average in order to prevent too many comparisons when new items are added to it.

I couldn’t figure out an easy way to compute the big O for this, so I ran some simulations instead. For a set of 1000 items, on average the user will have to perform 1528 +/- 2 comparisons to produce the top 25 items.

The code for the tree can be found in this gist. The nasty $counter global is just there to help me count the number of comparisons that are being performed for the purposes of the simulation. Some of the AVL Tree code was inspired by Bruno Preiss’s book on algorithms and data structures.

Comments
  1. Jon Elsas says:

    Its O(N log K ) where K is the number of top items (25 in this case). you need to look at all N data points, but the insertion time is less (log(K)) since the size of the tree is at most K elements.

  2. Jason Adams says:

    Ah good point, but wouldn’t that just be worst case? In the best case, it’s in descending order already and you make K log K + N – K comparisons. I was thinking in the average case you’d have some probability that is decreasing as you progress through the list that the value would be propagated through the tree.

    So building the first 25 items has K log K comparisons. Then for each remaining item the runtime would be O(1) if root. Maybe P(item > root) reduces to some constant value and the runtime is still O(N log K)?

  3. Jon Elsas says:

    I think you’re making it too complicated. best case would be (N – K + K log K), as you pointed out. If we assume K is a constant, or much smaller than N, then both worst- and best-case are essentially the same, O(N). If K = O(N), then its basically a full sort algorithm.

  4. Jason Adams says:

    Ahh, ok, I was indeed making it too complicated. Thanks for clearing that up! :)

  5. Jason Adams says:

    And just noticed that wordpress mangled greater than and less than symbols in my first comment.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>