I’m always a little annoyed I have to implement sorting Map keys by their values myself in Java. It seems like they should be a part of the standard Collections library or something. Maybe they are and I just haven’t seen it? My solution (gist) is based on feedback from Josh in the comments to a previous post. How does that look to you?
Posts Tagged ‘algorithms’
Java maps and sorting
Posted: 1 August 2009 in UncategorizedTags: algorithms, code, collections, java, programming, sorting
Top 25 Memes
Posted: 28 February 2009 in UncategorizedTags: algorithms, binary comparisons, memes, ranking algorithms, runtime analysis, simulations, top 25 lists
No this isn’t a list of the top 25 memes of all time. It’s a post inspired by the “Top 25″ memes that have been going around Facebook, MySpace, the intarwebs, etc. These memes are fairly simple, as many are. List your top 25 favorite albums in your collection. List your top 25 books. List your top 25 experiences with acne. The lists go on. But, how would you go about producing such a list? Do you just have the top 25 albums in your collection ranked and ready in your mind? Or do you have a crapload of albums and figuring out the top 25 would be a non-trivial task?
A friend approached me with this very problem and he brought up using binary comparisons. I have talked about binary comparisons when dealing with machine translation evaluation before, and many in the academic community have begun to appreciate their value for eliciting human preferences. Performing evaluations using binary comparisons is straightforward: give the evaluator pairs of items and ask which is better according to some subjective metric. In the case of machine translation evaluation, the task is to decide which translation is more human-like.
It ought to be obvious that a major problem with binary comparisons in a data set of any appreciable size is the massive number of comparisons you will need in order to rank every item in the set. For the sake of argument, let’s assume that your comparisons never contradict themselves. Under our assumption, the following does not occur: A is rated better than B, B is rated better than C, and C is rated better than A. This lets us bound the amount of comparisons to O(n log n), since we are basically dealing with a sorting algorithm using the human as the comparison operator. This was my friend’s idea and I really liked it.
But let’s say you have 1000 albums. 1000 * log (1000) = 9966 comparisons! That’s just too much work. Maybe we ought to just rate them on a scale of 1-100 (or whatever) and order them using that. It’s not perfect, but it’s ten times less work. But what if you were dedicated and actually wanted a good ranking?
Binary comparisons lend themselves to binary trees and a binary search tree would be ideal, so heapsort seems like a natural sorting algorithm. But we’re still stuck with O(n log n) to build it. Returning to the idea of “top 25″ memes, we don’t need the ordering of the entire set, just the top 25 items, so we can surely get rid of some of the work.
So rather than doing a full heapsort algorithm, I figured why not use an AVL tree. We don’t need to keep the entire tree balanced — we only need to worry about the right branch off the root which can be limited to 25 nodes. The algorithm is fairly simple (pseudocode):
for each item:compare item to root node:if item < root thendiscard
elseadd to right branch
balance the right branch
if the number in the right branch exceeds 25, replace the root node with the lowest value and delete that node.
This greatly reduces the number of comparisons needed since you very quickly bubble the best items into the top 25 branch. When a new item is being added to the top 25, only 4-5 comparisons need be made to determine its place. By continuously balancing the right branch, we can keep the height of that side as low as possible on average in order to prevent too many comparisons when new items are added to it.
I couldn’t figure out an easy way to compute the big O for this, so I ran some simulations instead. For a set of 1000 items, on average the user will have to perform 1528 +/- 2 comparisons to produce the top 25 items.
The code for the tree can be found in this gist. The nasty $counter global is just there to help me count the number of comparisons that are being performed for the purposes of the simulation. Some of the AVL Tree code was inspired by Bruno Preiss’s book on algorithms and data structures.
Fun with trees in Ruby
Posted: 20 November 2008 in UncategorizedTags: algorithms, github, inheritance, interfaces, java, programming, ruby, ruby gems, trees
Like Java and unlike Python, Ruby does not support multiple inheritance. Also there is no explicit way to create an interface. One way Ruby lets you get around both problems is by allowing you to include a module in a class. It’s not quite the same, but with the proper planning you can duplicate the functionality. Of course, one question you should always ask yourself when trying to shoehorn something from one language into another is if you’re really going about it the right way.
One way of implementing a Java-like interface in Ruby is by creating a module containing the skeleton functions you want the implementing class to implement.
module A
def method1() raise "not supported"; end
end
class B
include A
def method1
puts "now implemented"
end
end
Presto, module A is basically a Java interface. Of course, whether a method has been implemented is not checked until runtime when the method is actually called. Also if you mix in implemented methods alongside the interface methods, you have something very like an abstract class (minus the compile-time checking).
This came up when I was implementing a bunch of simple tree functions like finding the siblings of a node, finding the grandparent, the descendants, the leaves of a subtree, etc. It seemed like these were things that should be implemented already. And why not? So I threw all of those methods into a module and made it like a Java abstract class. All you have to implement is a method to call the parent of the current node (or return nil if there is none) and a method to get an Array of the children of the current node. Your class can pull children from a database, a file, something more complex — it doesn’t matter. Just implement those two methods and drop in the SimpleTree module and problem solved.
Since I’ve been having fun with gems, I made one for this and slapped it up on GitHub. To get it, just type:
sudo gem install ealdent-simple-tree
Assuming that you have already done this as some point in the past:
gem sources -a http://gems.github.com
Feel free to extend it, modify it, contribute to it, etc. I’m using the BSD license, which is my current favorite.
Cognate Identification: Orthographic Methods
Posted: 26 January 2008 in UncategorizedTags: algorithms, cognate identification, cognates, computational linguistics, historical linguistics, language change, linguistics, machine translation, natural language processing, orthography, string matching
In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump. In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)
Morning Madness
Posted: 2 November 2007 in UncategorizedTags: algorithms, bush, c, code, complaints, democrats, endless war, evil politicians, hillary clinton, patents, W
Ever get pissed twice before you’ve really even opened your eyes? This is why I shouldn’t read my RSS feeds so early in the morning. At the top of the list is Bush equating Democrats who oppose the war (as if it could be called opposition, anyway) to those who ignored Hitler and Lenin and then Hillary firing back. Am I mad at Bush for making this analogy? No and I think he’s correct, but not in the way he thinks. I’m more angry at Hillary for firing back and not recognizing her own culpability. The Sheepocrats sat back and did nothing four years ago when this war began and passed the Patriot Act before that. They have endorsed the war at every stage since and even their current so-called opposition is luke-warm and putrid with its weasliness. So yeah, they are like people who ignored the rise of Hitler and Lenin. If she had recognized that and said it publicly, it would have done her credit.
Next up, I was reading a few bit twiddling hacks and came across a nice one for branchless absolute value [hat tip]. The hacks are all in the public domain, too, so that’s good. He does list the occasional variation that is patented, an enormously helpful fact if you’re producing commercial software. So here is the patented version of the branchless absolute value:
int v; // we want to find the absolute value of v
int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v ^ mask) - mask;
The last ^ (XOR) – (subtract) combination represents the patent. What works also?
r = (v + mask) ^ mask;
As Sean points out, though, the patent probably could be contested if the holder (none other than Sun Microsystems) ever tried to enforce it. So what ticked me off is that such a thing could be patented. I raise my hands in impotent fury at the ludicrousness of software patents. I don’t blame the inventors for them, it’s something you pretty much have to do these days. I blame the system that makes that true.
Update
Did some benchmarks on the two versions of absolute value given above. Using a 3.06GHz processor, I could run 4 billion absolute values in 18.916 +/- 0.021 seconds for the patented version and 18.906 +/- 0.026 seconds for the free version. So no need to even bother with the patented version it looks like.
Merge sort fun
Posted: 18 September 2007 in UncategorizedTags: algorithms, code, java, memory, merge sort, programming, sort
Suppose you have an array of floating point numbers with each index into the array being an id number corresponding to some external data structure. You want to sort this array, but in doing so you would destroy the references to the id numbers, since the indexes of the array would no longer correspond to the correct id numbers in the external data structure. For example, let’s say we are dealing with customers of a store and each value in the array is their current balance.
balances = {25.61, 13.45, 89.75, 21.2, 96.50}
Each index in the balances array corresponds to an index in some other array, say:
names = {"Marjory Stewart-Baxter", "Hubert Cumberdale", "Barbara Logan-Price", "Jeremy Fisher", "Mable"}



A couple months ago,