So I decided to finally fart around with OpenCalais a little. There’s a nice video on the site that gives you an impression of what it is capable of, but it’s also like all videos about software: propaganda. Calais is basically Named Entity Recognition (NER) software that can be accessed via a web API. Whereas a regular NER system might recognize named entities like people, organizations, and places, Calais also recognizes relationships like corporate acquisitions. To be a little more clear if you aren’t familiar with NER, it is basically the task of identifying the proper nouns in a body of text. Named entities aren’t always proper nouns, but that is one starting point. Examples would be: John Hancock (Person), New York (Place), and Apple (Organization). Calais recognizes relationships, which means we get an extra layer of information: Acquisition(Microsoft, Yahoo!).
Calais is put out by Reuters which has a long history of helping out the NLP and IR research communities with data sets. Being Reuters, the data sets are all newswire stuff, and Calais is produced in that spirit. Currently the relationships and named entities available reflect that bias, but the list is expanding and it is probably flexible enough for most domains. Their claim is that with each new release, there will be additional entities and relationships available. Also, the software is completely open source free for commercial and private use. For this, I give Reuters props.
OpenCalais uses SOAP or HTTP post to issue requests and you can take a look at their tutorials for exactly how to use it. After some very shallow digging on the googles, I found an open source project called python-calais, which is basically just a script that wraps some text and sends it to the Calais service, then processes the output. The output is in RDF (resource description framework), which is a type of xml document that is not very friendly to the human eye but is nice and powerful otherwise. The python-calais script uses an rdf library for python, so you’ll need to download that if you don’t already have it.
Running it on my most popular post, you get the following output:
93B6642D-0D7C-37Ab-A92F-66Ebfef13C8D :: Recommender Systems (Industryterm)
0Dccb106-442A-3848-Bd0B-A388E73F4C8C :: Chris Sternal-Johnson (Person)
Aab0D16A-Ad5A-348A-A8Dc-58Cf59A1Bc15 :: Kristina Tikhova (Person)
42F476A0-2Fae-3F36-808D-803E4F620Ab0 :: Java (Technology)
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
4003D863-C7A6-3E6F-8E3C-0913Bf2F8242 :: National Aeronautics And Space Administration (Organization)
77D1Ceb3-9900-3Dd7-8351-F29408B21412 :: Carnegie Mellon University (Organization)
Ee58Ef4B-1C98-3F8B-Aff8-3Fd6E3D76A9E :: Wonderful Site (Industryterm)
8F12E551-A8F1-3705-866C-D44D1A6A54F4 :: Richard M. Hogg (Person)
Adee23De-B1B0-37Ad-9E20-1Fa8094F6D39 :: Steel (Industryterm)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
B6A8Dbfa-Fd35-32Bb-9E05-A2811C480000 :: Mike Tan (Person)
Ed8B5Fe4-616A-36Ea-8C47-3Eea7C71Aee0 :: Ben Eastaugh (Person)
D3Bcba58-00Fc-33C5-9346-Dbf6A2441867 :: Machine Learning (Technology)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
38116E8D-F8B4-3D03-B0Ad-C9A24B888E61 :: Jason M. Adams (Person)
4386B07C-F6B8-3991-Af74-Ab11A951F0Ee :: David Petar Novakovic (Person)
Aa14303F-F9F0-31B8-Adff-3B9C68E0A9F1 :: Language Technologies Institute (Organization)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
As it picks up everything on the page, there is a lot included there that isn’t related to the post about Old English translation. Also, it picks up some weird so-called industry terms like “steel.” If you filter out just the text (manually), the output is a little more sensible:
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
(The codes are unique identifiers.) Unfortunately, some important terms are still missed, like Old English. So it appears Calais has some growing to do, but it’s off to a good start. Part of the problem might be that that blog post is out of domain. I imagine with time, it will continue to improve. We’ll see.