If you follow news on the semantic web or new search engines, you may have heard of hakia. TechCrunch has done a small write up about their new semantic search API. TechCrunch is brutally hard on startups who aren’t fully operational, so there is a lot of criticism in that article that I take with a grain of salt. I like seeing startups open their services with APIs and I think they deserve some benefit of the doubt. Maybe I’m looking at it the wrong way, though, and the fact that TechCrunch does make such a stink ensures the startup will correct the problem asap, rather than farting around for a while.

In order to get access to the API, you just need to know how to access the API and to do that you need to email licensing@hakia.com. They will grant you an API key, which supposedly is not required, and send you a couple manuals documenting its usage. There is a more in-depth manual on using the API with .NET, but since I very rarely do anything with .NET, I turned to the manual on using it with scripting, which is very lightweight and not exhaustive. I had to guess a couple of parameters.

The API allows the following usages:

  1. web search - using hakia’s semantic search engine
  2. site search - limit to certain websites
  3. vertical search - search a particular database like PubMed or Wikipedia
  4. news search - search news articles
  5. news headlines - search for a geographic area
  6. hakia galleries - no idea what this is
  7. quotes - search for quotes
  8. cartoons - search for cartoons
  9. text summarization - summarize a body of text or webpage

In the works, they have:

  1. image search - search images
  2. video search - search videos
  3. text categorization - assign keywords to a body of text
  4. text characterization - expand keywords to related keywords for SEO or adwords
  5. text meaning representation - identifies what’s going on in the text (pdf for more info)

I don’t have any particular need for the web search parts at the moment, though there are certainly many uses I can think of, but the text summarization component is looked like fun. In a nutshell, text summarization is the task of taking a body of text and reducing the size to a small summary. Depending on the size of the text and your summarization needs, you should be able to scale just how much text is produced in the summary (the compression ratio determines how much). The underlying assumption is that not every sentence or phrase is necessary in a body of text in order to get the big picture. If you can find the sentences and phrases of primary importance and remove the rest, you can still convey the main idea of the text without all the details. Among the many problems you face doing this is that you have to be able to identify what the main idea of the text is and then find sentences most closely coupled with that. Not easy!

To test their summarizer, I fed it the last five articles I gave the smallest flying turd about on TechCrunch. So after skipping all the Yahoo! garbage, I have the following article headlines (which I’m not going to link to, to avoid spamming their trackback stuff):

  1. A Few Thoughts on How to Improve Google Friend Connect
  2. Lending Club Files For SEC Registration, Hopes To Resume Service
  3. Citi’s Mahaney: If Google Wants To Stay On Top, It Needs To Ramp Up Its Display Ad Revenues
  4. Justin.tv Captures Apartment Robbery
  5. Google Faces Off With Compete, Alexa, Comscore, Quantcast (And Soon Firefox)

My results using 15% compression were as follows (which took about 5 seconds to complete, in serial):

  1. A Few Thoughts on How to Improve Google Friend Connect Last week , I was lucky enough to be one of the first to embed Googles friend connect gadget in my blog . It was also interesting to read all the comments that people left at Googles comment widget . Why not ask people to upload pictures to Google Friend Connect to make sure people will recognize them ? Now you can Google their name to search for more information about them . I think that a profile info page , like the one we have on Google Reader , is a must have ! We all fell in love when Google came up with Gmail . Yet , I definitely think Google Friend Connect maybe one of the more popular services to debut this year . And how is Google going to make money off this someone please explain ? or just wait for Facebook Connect ? It is the first map mashup site ( not blog ) to get Friend Connect , and I see some good potential for the Friend Connect widgets for a site like mapdango . Friend Connect includes the ability to use OpenSocial widgets , so as more widgets are developed by Google and or third parties , sites that use Friend Connect will be able to include various types of functionality . Have you played with some of their other products lately , like Google Apps ? Somebody at Google needs to wake up . The distribution is Red Hat , but Hoelzle said Google doesnt use much of the distro . How does Friend Connect compare to MyBlogLog ? Much better organized than Google Friend Connect ! Yahoo with mybloglog has been once again surpassed by Google even that this widget is not yet powerfull as LoudAppeal . Both Google Friend Connect MyBlogLog helps you find new friends based on their blog reading taste . I think Friend Connect really has the potential to change some of our online habits .
  2. lending club files for sec registration , hopes to resume service Lending Club , the PP money lending site , has filed registration forms with the SEC . Lending Club originally launched as a Facebook app that allows users to lend money between themselves . They think big - the registration statement is for $ million !
  3. citis mahaney if google wants to stay on top , it needs to ramp up its display ad revenues It can sell display ads on its own sites , including YouTube , Google Images , and Google Maps . Mahaney estimates that Google can sell $ million worth of display ads ( including video ads ) on YouTube in . Google Images , Google Maps , Google Videos , and Google Finance could bring in another $ million , if fully plastered with ads . techcrunch japanese citi mahaney google Isnt the fact that google was clean and function at a time when the competitors were fully plastered with ads one of googles major advantages ? Therefore they should do the thing everyone else tried and has been getting their butts kicked , by google . I dont think google has Anything at all to worry about trust me The next big thing for google is local advertising .
  4. More details on the Justin . tv blog . Watch live video from chowdas channel on Justin . tv Economa y Empresas Una cmara de Justin . tv graba un ladrn - ALT techcrunch japanese justin tv Technically this is burglary , not robbery . Yes a burglary , not a robbery .
  5. google faces off with compete , alexa , comscore , quantcast ( and soon firefox ) And with the new Google Trends for Websites , Google has stopped short again . techcrunch japanese google compete alexa comscore quantcast firefox Google Trends for Websites - Missing Google Data - LGR Webmaster Blog There isnt any question of google using analytics data . I think google trends can be better than alexa , compete etc since google search is a majority traffic source for most websites . yes all of the compete , comscore , alexa does not provide the actual traffic data . Cookies ( which are called bacons ) similar to google analytics . I am glad google is giving this data out . How many people have iGoogle or google as their HP ? If I cant opt out , I wont use google analytics anymore . Looks like data from google websites is filtered from the results . No google . com , orkut . com etc Youd expect google to do it better . Anyone notice how google doesnt display traffic data for its own site ? You cant use it to check Google . com , though . I wonder how much of this data is from the Google Toolbar ?

I removed garbage characters that popped up, which is probably a unicode problem with the way I was running it more than anything else. As a result, I accidentally nuked the numbers. Just pretend like they’re there. My evaluation of each result is as follows:

  1. it’s a bit disjointed, but touches on the highlights so I give it a thumbs up
  2. decent, thumbs up
  3. ok, thumbs up
  4. a lot of garbage, doesn’t really tell you what’s going on, so thumbs down
  5. also a lot of garbage, hard to find the real point, so thumbs down

So it gets a 60% acceptability rating from me in my very unscientific field test. Smaller posts seem to be more susceptible to webpage garbage (menu bars, links to other stuff). A problem with summarizing blog posts is that you have comments thrown in with them. Comments tend to jump around in topic, and I suspect they cause the summarizer to trigger a change of topic and therefore something that has to be included in the summary. If I had removed the comments from the summarization and passed it text only, I’m sure the results would improve. Doing that automatically isn’t easy either, so we’re back to square one. I’d probably need to resort to using the rss feed, so actually it might not be that hard…

Anyhow, if you try it out, let me know what you think.