You are currently browsing the category archive for the 'code' category.

Since Ruby is my new favorite toy, I thought it would be fun to try my hand at C extensions.  I came across David Blei’s C code for Latent Dirichlet Allocation and it looked simple enough to convert into a Ruby module.  Ruby makes it very easy to wrap some C functions (which is good to know if you need a really fast implementation of something that gets called alot).  Wrapping a C library is slightly harder, but not horribly so.  Probably most of my challenge was the fact that it’s been so long since I wrote anything in C.

Since the code is open source, I decided to release the Ruby wrapper as a gem on GitHub.  I chose GitHub over RubyForge, because it uses Git and freakin’ rocks.  But GitHub is a story for another day.  Feel free to contribute and extend the project if you’re so inclined.

A basic usage example:


require 'lda'
# create an Lda object for training
lda = Lda::Lda.new
corpus = Lda::Corpus.new("data/data_file.dat")
lda.corpus = corpus
# run EM algorithm using random starting points
lda.em("random")
lda.load_vocabulary("data/vocab.txt")
# print the topic 20 words per topic
lda.print_topics(20)

You can also download the gem from GitHub directly:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

You only need the first line if you haven’t added GitHub to your sources before.

A couple of days ago, I wrote a script that would tweet anything you plurked. Thanks to some code from Neville Newey (based on PHP code by Charl van Niekerk), the plurk.py script I wrote has been updated to both plurk your tweets and tweet your plurks. This should work on both windows and linux machines. If you have access to a linux machine, I suggest setting up a cron job to take care of this. As I mentioned in the previous post, if you set up a cron job, be sure to change the path to plurkdb.dat to an absolute path. I have done the most testing on this with python 2.4 in linux.

This code is open source under the Creative Commons 3.0 Attribution license that this blog uses Creative Commons BSD license. Neville’s code appears to be under CC:Attribution 2.5 for South Africa, by what I could glean from his site. I have considered making this an open source project under Google code but have yet to take it all the way. Google sets a lifetime limit of 10 projects, so I will continue to hoard those against future need. If you make modifications to the code, please let me know and I will probably post them here and in the code for future releases, so we all win.

Note that the command line parameters have changed:

plurk.py <twitter username> <twitter password> <plurk username> <plurk password>

And of course, as with all software, use at your own risk.

If you want to use Plurk, but aren’t ready to leave Twitter, I wrote a little python script you can use to automatically mirror your plurks on Twitter. This will not work for response plurks, but your main plurks will be extracted and posted to your Twitter account with the prefix “plurking:” followed by your plurk.

The resulting tweet looks like this:

sample of what the script outputs in twitter

Download the script and set it up as a cron job (or you could execute it manually). It should work with python 2.4 and later. It stores a plurkdb.dat file (which you should probably assign an absolute path to, depending on the behavior of cron on your system). This file is checked every time it is run to make sure that duplicate plurks aren’t being tweeted. You should pass the following parameters on the command line (or modify the script so they are hardcoded, if you want): <twitter username> <twitter password> <plurk username> <plurk password>. Update: see later post on updated plurk script.  And like with all software, use at your own risk.

Please let me know if you have any problems with it or see room for improvement. I hacked this out in a hurry, so …

I discovered the java.util.Properties class a couple weeks ago in the ginormous Java API docs. If you’ve ever created a software project where you have a lot of different settings that change frequently, this is the class for you. In my research, I implement all these different algorithms for various things, find out they don’t work, implement something else, rinse, repeat. Being able to look back at my results from two months ago and then loading the exact same configuration and running the experiment all over again is a must. Enter the Properties class.

Read the rest of this entry »

I mentioned the esoteric programming language brainfuck a little while back. It consists of 8 operations and was created in order to make the smallest compiler in the world (I think the current best is 174 bytes). I was reading a post over on Good Math, Bad Math that defines arithmetic in terms of sets. Pretty basic if you’ve done anything with set theory, but Mark has a clear way of explaining things so I usually try to read all of his posts. I’ve been playing catch-up today.  It struck me immediately how closely the set form that Mark describes matches the syntax/logical structure of brainfuck.  So I decided to play around a little.  Read on for more.

Read the rest of this entry »

Either the coolest or the stupidest programming language in the world, brainfuck was designed by Urban Müller in order to create the world’s smallest compiler of a Turing-complete programming language. Originally his compiler was 240 bytes in size, but he reportedly got it down to about 200 bytes. Others have gotten it below that. The language consists of only 8 operations, which I will go into after the jump.

Read the rest of this entry »

Ever get pissed twice before you’ve really even opened your eyes? This is why I shouldn’t read my RSS feeds so early in the morning. At the top of the list is Bush equating Democrats who oppose the war (as if it could be called opposition, anyway) to those who ignored Hitler and Lenin and then Hillary firing back. Am I mad at Bush for making this analogy? No and I think he’s correct, but not in the way he thinks. I’m more angry at Hillary for firing back and not recognizing her own culpability. The Sheepocrats sat back and did nothing four years ago when this war began and passed the Patriot Act before that. They have endorsed the war at every stage since and even their current so-called opposition is luke-warm and putrid with its weasliness. So yeah, they are like people who ignored the rise of Hitler and Lenin. If she had recognized that and said it publicly, it would have done her credit.

 

Next up, I was reading a few bit twiddling hacks and came across a nice one for branchless absolute value [hat tip]. The hacks are all in the public domain, too, so that’s good. He does list the occasional variation that is patented, an enormously helpful fact if you’re producing commercial software. So here is the patented version of the branchless absolute value:


int v; // we want to find the absolute value of v
int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v ^ mask) - mask;

The last ^ (XOR) - (subtract) combination represents the patent. What works also?

r = (v + mask) ^ mask;

As Sean points out, though, the patent probably could be contested if the holder (none other than Sun Microsystems) ever tried to enforce it. So what ticked me off is that such a thing could be patented. I raise my hands in impotent fury at the ludicrousness of software patents. I don’t blame the inventors for them, it’s something you pretty much have to do these days. I blame the system that makes that true.

Update

Did some benchmarks on the two versions of absolute value given above.  Using a 3.06GHz processor, I could run 4 billion absolute values in 18.916 +/- 0.021 seconds for the patented version and 18.906 +/- 0.026 seconds for the free version.  So no need to even bother with the patented version it looks like.

So I’ve been reading A New Kind of Science by Stephen Wolfram, the creator of Mathematica. It was hyped up big time back when he first wrote it, since he had gone silent for a number of years, hinting that he was about to do something big. So my middle little sister got me the book for Christmas (cuz she rocks) and I cracked it open a few times. It’s about 846 pages of text (yipes!) and then another 351 pages of notes. Quite daunting. So I put it down and have meant to pick it back up a thousand times. Today I was needing a diversion because a particular C++ issue was giving me fits.

 

In Chapter 2, Wolfram introduces a fairly simple 2-dimensional cellular automata (one spatial dimension, one temporal dimension). The temporal dimension can be plotted as another spatial dimension producing a nice little spreadsheet style graph. Each cell of the graph can be considered a bit. Depending on whether the bit is set, the cell is either shaded or not. So the single line in the spatial dimension contains some initial setting. Let’s say there is one single bit set in the middle of the line, so it might look like this:

000000000010000000000

Read the rest of this entry »

For one of my homework assignments, I have to solve words encrypted via a substitution cipher. These ciphers were insecure before computers came around, but they are still fun. If you’re unfamiliar with them, you’d probably recognize them as the cryptograms (”Cryptoquote”) in your local newspaper. In the simplest form, each letter is mapped to a different letter of the alphabet. A lot of people do these for fun and I know at least one person reading this does. The result is a run of text that might look like:

ov umy rfgs f nmg
MY DOG EATS A LOT

There are many ways of going about solving substitution ciphers, but a common way is by counting frequencies of characters. As most people know, e tends to occur more than other letters in most written English. The rest of the letters typically follow a pattern, as well, but that pattern degenerates once you leave the most common letters. The domain of the text you are examining is fundamentally important here. By domain I mean whether this text is from a newspaper, an IM, transcribed speech, etc. You can also look at bigrams, two character sequences, to find the most commonly appearing sequences. In English, th appears much more ty, but ty still occurs. When trying to solve substitution ciphers this way, you are essentially matching the frequency distribution of the cipher text to the distribution of English and building a mapping from there. To put that a different way, you are matching up the most common letters or sequences in the garbled text with the most common real English letters or sequences.

Once the frequency counts have revealed the most common letters, many people proceed to deduction to eliminate the rest. Of course, this requires knowledge of English words directly, which has an impact on computational approaches to solving substitution ciphers automatically. I’m curious what approaches people have taken (if any) other than using a dictionary of English words and trying to find matches from there.

Read the rest of this entry »

In a recent press release, kannuu is claiming to have revolutionized text entry. They claim that you can now perform text entry with just your thumb at the same speed of a regular keyboard. Too good to be true? Here is their method, complete with Hype™.

“Advancing text entry exponentially, kannuu’s powerful and precise Partial Word Completion® technology enables users with a fail-safe text entry solution. The kannuu application appears on device, as a four-point diamond shape, comprised of the most popular letters in the database it is indexing, with the center kannuu logo leading to the next set of choices.”

They registered a trademark on the phrase “partial word completion”?? Blerg. Not only do they have an über lame web 2.0 name in lowercase, they gotta stop people from marketing a similar technology under their oh-so-not-original name. Why does this make me so angry? Anyhow, I’m running off sideways on a rant that is pretty insignificant.

The real point here is the potential for coolness. So here is the technology: you enter a letter, it presents you with a “diamond” shape and the most common letters or group of letters that follow the letter(s) you just entered. In this way, most of your everyday phrases will be right up at the top of the list of things you’re presented so you could potentially be entering words with fewer keystrokes and all with very little thumb movement. This could really revolutionize key input and maybe bring pocket computers to reality [source].

So here is what I think the technology is based on. A very common technique in language technologies is the use of n-grams. So they use a character-based n-gram model to predict the most common letter or letters that you would type next based on some corpus. This isn’t anything new. Cell phones already have a T9 input method that guesses the most common word based on the single letters you choose. This isn’t all that different. If they have done the interface well, that could be a serious improvement.

If you’re interested in character-based n-gram models, I go into them in more depth after the jump.

Read the rest of this entry »

Suppose you have an array of floating point numbers with each index into the array being an id number corresponding to some external data structure. You want to sort this array, but in doing so you would destroy the references to the id numbers, since the indexes of the array would no longer correspond to the correct id numbers in the external data structure. For example, let’s say we are dealing with customers of a store and each value in the array is their current balance.

balances = {25.61, 13.45, 89.75, 21.2, 96.50}

Each index in the balances array corresponds to an index in some other array, say:

names = {"Marjory Stewart-Baxter", "Hubert Cumberdale", "Barbara Logan-Price", "Jeremy Fisher", "Mable"}

Read the rest of this entry »

 Follow me on Twitter
 RSS Feed

About Me

Jason M. Adams

My name is Jason Adams and I work on opinion mining for a growing startup in Atlanta, GA.

Calendar

December 2008
S M T W T F S
« Nov    
 123456
78910111213
14151617181920
21222324252627
28293031  

Archives

Site Statistics

  • 105,404 reads

Site Information

Contact me: jaso...@gmail.com

Creative Commons License

This work by Jason M. Adams is licensed under a Creative Commons Attribution 3.0 License.

Header image credit seakwenby.

Random Crap