You are currently browsing the category archive for the 'python' category.

I just spent the day with a couple of friends at the Google App Engine Hackathon in Atlanta.  We got to see Google Atlanta - or the public part of it anyway.  We weren’t permitted in the cafeteria or in the actual office area, which would have required signing non-disclosure agreements.  The office was about what I expected — the Google colors were in abundance, there were giant bouncing balls, and free drinks! (non-alcoholic)

We spent the day in a fairly hot conference room hacking away on a variety of projects.  We set up teams beforehand to work on projects that people proposed and I chose to work on a variation of a computing puzzles site, dubbed LangWar.  The idea is fairly simple in the early stages:  people submit programming puzzles and other people post their solutions in code form.  You can vote which questions you like and which answers you like (or dislike).  You can also leave comments on questions and answers.  The result of the ratings is that the best questions will be counted higher, in a method similar to Reddit, and the best answers will trickle to the top based on the votes of users.

This is very similar to Stack Overflow, but different in that it is intended to be more of a puzzle solving site that pits implementations in different programming languages against each other.  It’s sort of a battle royale of programming languages - thus the name, LangWar.  It’s more of an enhanced version of Project Euler, where people can vote on the questions and answers.

In any case, it was a great chance to get my hands dirty in Google App Engine, meet some Atlanta python coders, and have fun.  It’ll be interesting to see where LangWar goes from here, if it does go anywhere.

I’ve begun learning ruby for my new job, a language that doesn’t seem to have really gotten any traction in the NLP community (at least not that I’ve heard).  I had been using python for my NLP stuff (homework and projects) and Java for my recommender system stuff.  In retrospect, I could have used python for the recommender stuff, but I wasn’t aware of some speed-ups so resorted to Java.  Of course, the recommender stuff isn’t strictly NLP.  Ruby is just as well suited as python and seems a lot better than Java for many tasks (though Java certainly has its place).  At the very least, a scripting language like ruby or python is great for prototyping.  It’s easy to test new ideas quickly.

I was reading through Pang et al (2002), which deals with classifying movie reviews as positive or negative.  They look at three machine learning approaches:  Naive Bayes, Maximum Entropy classifier and Support Vector Machines.  This seemed like a good opportunity to try out my nascent ruby skills, since it’s the kind of crap I can roll together in python in short order (and do all the time).  So I downloaded the data for the paper (actually I downloaded the later data from the 2004 paper).  There are 1000 positive and 1000 negative movie reviews.  The task is to train a classifier to determine whether a review expresses a positive opinion (the author liked the movie) or a negative opinion (the author did not like the movie).  I chose to just use SVMs since they do best for this task according to the paper, they do really well for text categorization, and they are easy to use and download.

The results were quite nice.  Ruby turned out to be just as handy as python at manipulating text and dealing with crossfold validation:  the two main “challenges” in implementing this paper.  I used tf-idf for weighting the features and thresholded document frequency to discard words that didn’t appear in at least three reviews.  The result was that I achieved about 85.7% accuracy using the same cross validation setup described in their followup work (Pang and Lee, 2004).  In other words, the classifier could correctly guess the opinion orientation of reviews as positive or negative nearly 86% of the time.

Pang et al (2002) discussed some of their errors and hypothesized that discourse analysis might improve results, since reviewers often use sarcasm.  There’s also the case where authors use a “thwarted expectations” narrative.  This offered me one of the few chuckles I’ve ever had while reading a research paper:

“I hate the Spice Girls. … [3 things the author hates about them] …  Why I saw this movie is a really, really, really long story, but I did and one would think I’d despise every minute of it.  But… Okay, I’m really ashamed of it, but I enjoyed it.  I mean, I admit it’s a really awful movie …the ninth floor of hell… The plot is such a mess that it’s terrible.  But I loved it.”

References

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  ”Thumbs Up?  Sentiment Classification Using Machine Learning Techniques.”  In Proceedings of the ACL 02 conference on Empirical Methods in Natural Language Processing - Volume 10, July 2002. [pdf]

Bo Pang and Lillian Lee.  ”A Sentimental Education:  Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.”  In Proceedings of the ACL, 2004. [pdf]

A couple of days ago, I wrote a script that would tweet anything you plurked. Thanks to some code from Neville Newey (based on PHP code by Charl van Niekerk), the plurk.py script I wrote has been updated to both plurk your tweets and tweet your plurks. This should work on both windows and linux machines. If you have access to a linux machine, I suggest setting up a cron job to take care of this. As I mentioned in the previous post, if you set up a cron job, be sure to change the path to plurkdb.dat to an absolute path. I have done the most testing on this with python 2.4 in linux.

This code is open source under the Creative Commons 3.0 Attribution license that this blog uses Creative Commons BSD license. Neville’s code appears to be under CC:Attribution 2.5 for South Africa, by what I could glean from his site. I have considered making this an open source project under Google code but have yet to take it all the way. Google sets a lifetime limit of 10 projects, so I will continue to hoard those against future need. If you make modifications to the code, please let me know and I will probably post them here and in the code for future releases, so we all win.

Note that the command line parameters have changed:

plurk.py <twitter username> <twitter password> <plurk username> <plurk password>

And of course, as with all software, use at your own risk.

If you want to use Plurk, but aren’t ready to leave Twitter, I wrote a little python script you can use to automatically mirror your plurks on Twitter. This will not work for response plurks, but your main plurks will be extracted and posted to your Twitter account with the prefix “plurking:” followed by your plurk.

The resulting tweet looks like this:

sample of what the script outputs in twitter

Download the script and set it up as a cron job (or you could execute it manually). It should work with python 2.4 and later. It stores a plurkdb.dat file (which you should probably assign an absolute path to, depending on the behavior of cron on your system). This file is checked every time it is run to make sure that duplicate plurks aren’t being tweeted. You should pass the following parameters on the command line (or modify the script so they are hardcoded, if you want): <twitter username> <twitter password> <plurk username> <plurk password>. Update: see later post on updated plurk script.  And like with all software, use at your own risk.

Please let me know if you have any problems with it or see room for improvement. I hacked this out in a hurry, so …

So I decided to finally fart around with OpenCalais a little. There’s a nice video on the site that gives you an impression of what it is capable of, but it’s also like all videos about software: propaganda. Calais is basically Named Entity Recognition (NER) software that can be accessed via a web API. Whereas a regular NER system might recognize named entities like people, organizations, and places, Calais also recognizes relationships like corporate acquisitions. To be a little more clear if you aren’t familiar with NER, it is basically the task of identifying the proper nouns in a body of text. Named entities aren’t always proper nouns, but that is one starting point. Examples would be: John Hancock (Person), New York (Place), and Apple (Organization). Calais recognizes relationships, which means we get an extra layer of information: Acquisition(Microsoft, Yahoo!).

Calais is put out by Reuters which has a long history of helping out the NLP and IR research communities with data sets. Being Reuters, the data sets are all newswire stuff, and Calais is produced in that spirit. Currently the relationships and named entities available reflect that bias, but the list is expanding and it is probably flexible enough for most domains. Their claim is that with each new release, there will be additional entities and relationships available. Also, the software is completely open source free for commercial and private use. For this, I give Reuters props.

OpenCalais uses SOAP or HTTP post to issue requests and you can take a look at their tutorials for exactly how to use it. After some very shallow digging on the googles, I found an open source project called python-calais, which is basically just a script that wraps some text and sends it to the Calais service, then processes the output. The output is in RDF (resource description framework), which is a type of xml document that is not very friendly to the human eye but is nice and powerful otherwise. The python-calais script uses an rdf library for python, so you’ll need to download that if you don’t already have it.

Running it on my most popular post, you get the following output:

93B6642D-0D7C-37Ab-A92F-66Ebfef13C8D :: Recommender Systems (Industryterm)
0Dccb106-442A-3848-Bd0B-A388E73F4C8C :: Chris Sternal-Johnson (Person)
Aab0D16A-Ad5A-348A-A8Dc-58Cf59A1Bc15 :: Kristina Tikhova (Person)
42F476A0-2Fae-3F36-808D-803E4F620Ab0 :: Java (Technology)
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
4003D863-C7A6-3E6F-8E3C-0913Bf2F8242 :: National Aeronautics And Space Administration (Organization)
77D1Ceb3-9900-3Dd7-8351-F29408B21412 :: Carnegie Mellon University (Organization)
Ee58Ef4B-1C98-3F8B-Aff8-3Fd6E3D76A9E :: Wonderful Site (Industryterm)
8F12E551-A8F1-3705-866C-D44D1A6A54F4 :: Richard M. Hogg (Person)
Adee23De-B1B0-37Ad-9E20-1Fa8094F6D39 :: Steel (Industryterm)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
B6A8Dbfa-Fd35-32Bb-9E05-A2811C480000 :: Mike Tan (Person)
Ed8B5Fe4-616A-36Ea-8C47-3Eea7C71Aee0 :: Ben Eastaugh (Person)
D3Bcba58-00Fc-33C5-9346-Dbf6A2441867 :: Machine Learning (Technology)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
38116E8D-F8B4-3D03-B0Ad-C9A24B888E61 :: Jason M. Adams (Person)
4386B07C-F6B8-3991-Af74-Ab11A951F0Ee :: David Petar Novakovic (Person)
Aa14303F-F9F0-31B8-Adff-3B9C68E0A9F1 :: Language Technologies Institute (Organization)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)

As it picks up everything on the page, there is a lot included there that isn’t related to the post about Old English translation. Also, it picks up some weird so-called industry terms like “steel.” If you filter out just the text (manually), the output is a little more sensible:

6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)

(The codes are unique identifiers.) Unfortunately, some important terms are still missed, like Old English. So it appears Calais has some growing to do, but it’s off to a good start. Part of the problem might be that that blog post is out of domain. I imagine with time, it will continue to improve. We’ll see.

I attended a Matlab training seminar yesterday with the dual topics of “Advanced Matlab Programming” and “Distributed and Parallel Computing.” Of the two, the Advanced section was more interesting, though my original motivation for going was the parallel computing part. In the morning, I felt like it was going to be a waste because my Matlab programming skills are weak, and if my advisor had not strongly suggested I attend, I might’ve skipped it. I’m glad he did, because it was surprisingly enjoyable and I felt like it was right on my level. This might be because programming in Matlab isn’t especially hard or different from other programming languages and I know enough to get by already. Or it might be because Matlab is becoming a little more like Python.

Read the rest of this entry »

From the most excellent xkcd:

I felt the exact same way when I first picked up python.  It was like finding the holy grail of programming languages.  To be able to just throw things into a list and access them without having to worry about casting.  To throw around functions like they were variables.  To weave functions out of thin air and watch them vanish when their usefulness had expired.  It was magic.

Of course, the honeymoon faded.  I still use python as a first resort.  As a programming language for exploring new ideas, it can’t be beaten.  Development time is ridiculously fast.  There has been effort to get the runtime up to snuff as well, but with much reluctance I’m forced to admit it doesn’t compare to C or even Java, may God have mercy on my soul.  Granted, it all depends on the application, blah blah blah.

Despite all that, I still love it.  It’s definitely first in my heart as far as programming languages go.

So I’ve been reading A New Kind of Science by Stephen Wolfram, the creator of Mathematica. It was hyped up big time back when he first wrote it, since he had gone silent for a number of years, hinting that he was about to do something big. So my middle little sister got me the book for Christmas (cuz she rocks) and I cracked it open a few times. It’s about 846 pages of text (yipes!) and then another 351 pages of notes. Quite daunting. So I put it down and have meant to pick it back up a thousand times. Today I was needing a diversion because a particular C++ issue was giving me fits.

 

In Chapter 2, Wolfram introduces a fairly simple 2-dimensional cellular automata (one spatial dimension, one temporal dimension). The temporal dimension can be plotted as another spatial dimension producing a nice little spreadsheet style graph. Each cell of the graph can be considered a bit. Depending on whether the bit is set, the cell is either shaded or not. So the single line in the spatial dimension contains some initial setting. Let’s say there is one single bit set in the middle of the line, so it might look like this:

000000000010000000000

Read the rest of this entry »

For one of my homework assignments, I have to solve words encrypted via a substitution cipher. These ciphers were insecure before computers came around, but they are still fun. If you’re unfamiliar with them, you’d probably recognize them as the cryptograms (”Cryptoquote”) in your local newspaper. In the simplest form, each letter is mapped to a different letter of the alphabet. A lot of people do these for fun and I know at least one person reading this does. The result is a run of text that might look like:

ov umy rfgs f nmg
MY DOG EATS A LOT

There are many ways of going about solving substitution ciphers, but a common way is by counting frequencies of characters. As most people know, e tends to occur more than other letters in most written English. The rest of the letters typically follow a pattern, as well, but that pattern degenerates once you leave the most common letters. The domain of the text you are examining is fundamentally important here. By domain I mean whether this text is from a newspaper, an IM, transcribed speech, etc. You can also look at bigrams, two character sequences, to find the most commonly appearing sequences. In English, th appears much more ty, but ty still occurs. When trying to solve substitution ciphers this way, you are essentially matching the frequency distribution of the cipher text to the distribution of English and building a mapping from there. To put that a different way, you are matching up the most common letters or sequences in the garbled text with the most common real English letters or sequences.

Once the frequency counts have revealed the most common letters, many people proceed to deduction to eliminate the rest. Of course, this requires knowledge of English words directly, which has an impact on computational approaches to solving substitution ciphers automatically. I’m curious what approaches people have taken (if any) other than using a dictionary of English words and trying to find matches from there.

Read the rest of this entry »

In a recent press release, kannuu is claiming to have revolutionized text entry. They claim that you can now perform text entry with just your thumb at the same speed of a regular keyboard. Too good to be true? Here is their method, complete with Hype™.

“Advancing text entry exponentially, kannuu’s powerful and precise Partial Word Completion® technology enables users with a fail-safe text entry solution. The kannuu application appears on device, as a four-point diamond shape, comprised of the most popular letters in the database it is indexing, with the center kannuu logo leading to the next set of choices.”

They registered a trademark on the phrase “partial word completion”?? Blerg. Not only do they have an über lame web 2.0 name in lowercase, they gotta stop people from marketing a similar technology under their oh-so-not-original name. Why does this make me so angry? Anyhow, I’m running off sideways on a rant that is pretty insignificant.

The real point here is the potential for coolness. So here is the technology: you enter a letter, it presents you with a “diamond” shape and the most common letters or group of letters that follow the letter(s) you just entered. In this way, most of your everyday phrases will be right up at the top of the list of things you’re presented so you could potentially be entering words with fewer keystrokes and all with very little thumb movement. This could really revolutionize key input and maybe bring pocket computers to reality [source].

So here is what I think the technology is based on. A very common technique in language technologies is the use of n-grams. So they use a character-based n-gram model to predict the most common letter or letters that you would type next based on some corpus. This isn’t anything new. Cell phones already have a T9 input method that guesses the most common word based on the single letters you choose. This isn’t all that different. If they have done the interface well, that could be a serious improvement.

If you’re interested in character-based n-gram models, I go into them in more depth after the jump.

Read the rest of this entry »

Last year I worked on a project with my friend Israel Kloss called FreeAlert. The site is not-for-profit and was originally intended to help refugees entering the Washington, DC area find things they need for free. It now covers major metropolitan areas all across the United States and is intended to benefit everyone.

The idea is simple. Enter some keywords and get matching free items off of craigslist for your city sent to your cell phone. You can enter up to 5 sets of keywords and each set has exclusion terms. This makes it so that you can receive notices with the term computer but without the term desk. Israel just took the site out of private beta last week and it is currently in public beta mode.

It was an interesting project for me because it gave me the chance to work in python on some http and smtp protocol code. It also gave me the chance to work on processing xml and rss feeds. Definitely some cool stuff there and it has resulted in a spin-off that will probably be functioning fairly soon. Israel is one of those people with a lot of great ideas and he has the personality to inspire you with them. Plus he is also one of those rare people that actually care enough about the suffering of others to actually try to do something about it, which you just have to admire.

So, please, check it out and let us know how we can make it better.

 Follow me on Twitter
 RSS Feed

About Me

Jason M. Adams

My name is Jason Adams and I work on opinion mining for a growing startup in Atlanta, GA.

Calendar

December 2008
S M T W T F S
« Nov    
 123456
78910111213
14151617181920
21222324252627
28293031  

Archives

Site Statistics

  • 105,417 reads

Site Information

Contact me: jaso...@gmail.com

Creative Commons License

This work by Jason M. Adams is licensed under a Creative Commons Attribution 3.0 License.

Header image credit seakwenby.

Random Crap