So I decided to finally fart around with OpenCalais a little. There’s a nice video on the site that gives you an impression of what it is capable of, but it’s also like all videos about software: propaganda. Calais is basically Named Entity Recognition (NER) software that can be accessed via a web API. Whereas a regular NER system might recognize named entities like people, organizations, and places, Calais also recognizes relationships like corporate acquisitions. To be a little more clear if you aren’t familiar with NER, it is basically the task of identifying the proper nouns in a body of text. Named entities aren’t always proper nouns, but that is one starting point. Examples would be: John Hancock (Person), New York (Place), and Apple (Organization). Calais recognizes relationships, which means we get an extra layer of information: Acquisition(Microsoft, Yahoo!).
Calais is put out by Reuters which has a long history of helping out the NLP and IR research communities with data sets. Being Reuters, the data sets are all newswire stuff, and Calais is produced in that spirit. Currently the relationships and named entities available reflect that bias, but the list is expanding and it is probably flexible enough for most domains. Their claim is that with each new release, there will be additional entities and relationships available. Also, the software is completely open source free for commercial and private use. For this, I give Reuters props.
OpenCalais uses SOAP or HTTP post to issue requests and you can take a look at their tutorials for exactly how to use it. After some very shallow digging on the googles, I found an open source project called python-calais, which is basically just a script that wraps some text and sends it to the Calais service, then processes the output. The output is in RDF (resource description framework), which is a type of xml document that is not very friendly to the human eye but is nice and powerful otherwise. The python-calais script uses an rdf library for python, so you’ll need to download that if you don’t already have it.
Running it on my most popular post, you get the following output:
93B6642D-0D7C-37Ab-A92F-66Ebfef13C8D :: Recommender Systems (Industryterm)
0Dccb106-442A-3848-Bd0B-A388E73F4C8C :: Chris Sternal-Johnson (Person)
Aab0D16A-Ad5A-348A-A8Dc-58Cf59A1Bc15 :: Kristina Tikhova (Person)
42F476A0-2Fae-3F36-808D-803E4F620Ab0 :: Java (Technology)
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
4003D863-C7A6-3E6F-8E3C-0913Bf2F8242 :: National Aeronautics And Space Administration (Organization)
77D1Ceb3-9900-3Dd7-8351-F29408B21412 :: Carnegie Mellon University (Organization)
Ee58Ef4B-1C98-3F8B-Aff8-3Fd6E3D76A9E :: Wonderful Site (Industryterm)
8F12E551-A8F1-3705-866C-D44D1A6A54F4 :: Richard M. Hogg (Person)
Adee23De-B1B0-37Ad-9E20-1Fa8094F6D39 :: Steel (Industryterm)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
B6A8Dbfa-Fd35-32Bb-9E05-A2811C480000 :: Mike Tan (Person)
Ed8B5Fe4-616A-36Ea-8C47-3Eea7C71Aee0 :: Ben Eastaugh (Person)
D3Bcba58-00Fc-33C5-9346-Dbf6A2441867 :: Machine Learning (Technology)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
38116E8D-F8B4-3D03-B0Ad-C9A24B888E61 :: Jason M. Adams (Person)
4386B07C-F6B8-3991-Af74-Ab11A951F0Ee :: David Petar Novakovic (Person)
Aa14303F-F9F0-31B8-Adff-3B9C68E0A9F1 :: Language Technologies Institute (Organization)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
As it picks up everything on the page, there is a lot included there that isn’t related to the post about Old English translation. Also, it picks up some weird so-called industry terms like “steel.” If you filter out just the text (manually), the output is a little more sensible:
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
(The codes are unique identifiers.) Unfortunately, some important terms are still missed, like Old English. So it appears Calais has some growing to do, but it’s off to a good start. Part of the problem might be that that blog post is out of domain. I imagine with time, it will continue to improve. We’ll see.




5 comments
Comments feed for this article
31 May 2008 at 22:48:30
David Novakovic
Hey! Damn blogroll getting included. :)
Thanks for doing this write up dude, I’ve been looking at opencalais for a bit, the main reason I haven’t looked more seriously is I seem to recall the service having some restrictions for commercial use. :( Maybe I need to have a closer look.
31 May 2008 at 23:16:07
Jason Adams
There may be some restrictions for commercial use, I guess it just depends on what your intentions are. They do offer levels of the service you can pay for, which I guess guarantees your performance. I have noticed some slowness at times.
Also I made the mistake above of calling it open source — it’s not. The web service is free for private and commercial use, but the source is closed.
1 June 2008 at 09:01:02
Chris
I haven’t had a chance to look into opencalais yet, but I’m looking forward to it. It’s funny though. In the late 90s, during that amazing bubble period, IR was supposed to be the the great and powerful NLP application. I got my first industry job at a small start-up doing IR in 2000. But it all seemed to crash and burn when the technology didn’t seem to go anywhere.
What does the average person need IR for? They need Google, sure, but do they need IR? The current money in IR is mostly in biomedical searching, especially gene searching. This is what LingPipe is heavy into. This LingPipe blog post on their work on Autism is especially relevant.
But what seems to have happened is that IR has become a highly specialized tool set marketed towards highly specific search problems. Its general utility has not yet been shown.
As a linguist, I love IR and find it inherently valuable. But, as a business professional, I have yet to see anyone impress me with an IR business model.
11 June 2008 at 10:35:52
Tom Tague
Hi, Tom Tague from Calais here.
First, thanks for taking note of Calais and taking the effort to experiment with it. In addition to what you’ve already explained - a few comments.
First, Calais is available for commercial or non-commercial use. Period. The default license gives you 40,000 transactions per day - and we’re happy to entertain requests for higher volumes if justified.
Second, I’d encourage every potential user of Calais to not stop at entity extraction. One of the most powerful features of Calais is the ability to expose relationships such as person:company, person:position, etc - many dozens of them in all.
And third, the Calais capabilities will continue to grow every month - as it has since release in January of this year. We’re always adding new entities and relationships and we and the Calais community are regularly deploying new tools to make Calais useful. For a sample of what’s available please visit the Gallery at http://www.opencalais.com. It has everything from code libraries to plugins for WordPress and Drupal.
Regards,
15 July 2008 at 17:19:47
Krista Thomas
Hi Jason,
Krista from Calais here. Wanted to let you know that Calais 2.1 is live.
In addition to our ongoing addition of new entities and vocabularies, the updated release features relevance ranking and integration with Yahoo Pipes.
BTW – we have also updated our browser plug-in Gnosis for Firefox 3 and created a version for IE. Go to ‘Tools’ on OpenCalais.com.
Thanks.