Posts Tagged ‘google’

This week has given me two new toys to play with, and you could probably say both were bought at the dollar store.  The first was Microsoft‘s release of Rebranded Live, aka Bing.  Bing’s search results have been poor (for me), but not much poorer than Google‘s.  Just enough poorer for me to see no reason to really switch, which is very bad for Microsoft.  There are neat little features, like pop up feed links for blog posts and previews.  I like it, but it’s not much.  Where they shine is in image search, which incorporates similar image search already (Google still has theirs in Labs).  Google Similar Images knocked my socks off at first, but then it just seemed like it should be renamed Google Identical Images.  Not much diversity.  Bing got this part right.  The images are similar, not identical.  There is a diverse collection and the navigation is great.  Kudos, Live Labs, for that one.  Is it perfect?  Nope, but it’s better than what I was using.

The next toy was Google Squared, which inspired this tweet right after I tried it:

Google Squared.  You had me at hello.

Google Squared. You had me at hello.

Further playing around with it convinced me that this would have been a nice tool to have when I was doing ridiculous term papers in high school.  Term papers about crap I didn’t care about.  Basically random stuff.  G^2 is great for that, but really not very helpful otherwise.  It was pretty awesome finding out the number of victims of 30 different serial killers all at once, though.  As quality improves (assuming it does), this could be pretty useful.  Quality has to get there though.  90% of time using it is trial and error trying to find something that works.  I was able to add some sorting algorithms to a square, but couldn’t find a single column to add that actually had something in it (that wasn’t absurd).  Wolfram|Alpha is still the winner in the knowledge engine department, methinks.

Some Google Squared Results

Some Google Squared Results

Reblog this post [with Zemanta]

Perhaps you’ve heard of the latest brainchild of the Wunderkind Stephen WolframWolfram|Alpha.  Matthew Hurst nicknamed it Alphram today and I agree that’s a much better name.   Wolfram|Alpha (W|A henceforth) is not a search engine, it’s a knowledge engine.  It will compete with Google on a slice of traffic that Google really isn’t all that hot in for now, comparative questioning answering.  When you ask Google something like “How does the GDP of South Africa compare to China?” you hope you get back something relevant in the first few results (spoiler alert:  you don’t).  When you ask that of W|A, you get exactly what you’re looking for.  Beautiful.  W|A’s so-called natural language interface isn’t perfect, though.  You get a lot of flakiness from it until you start to recognize what works and what doesn’t.

Now let’s be honest.  How often do we search for that kind of thing?  Not very often.  I think that’s partly because Google is notoriously bad at it.  Once we start to get a handle on what W|A is capable of, I think people will start expecting more of their friendly neighborhood search giant.  Google claims to have a few tricks up its sleeves, but everything I’ve seen out of Google lately has been such a disappointment I am deeply skeptical.  The new trick is called Google Squared and it returns search results in a spreadsheet format, breaking down the various facets of the things you are searching for.  In the demo, it shows stuff like rollercoaster drop speeds, heights, etc when you search for roller coasters.  You can add to the square and do some pretty nifty stuff.  TechCrunch claims this will kill W|A.  I think the two could be complementary.  Based on the demo, I expect W|A will return results of a higher calibre, but will miss out on a lot of queries because the knowledge is just missing.  Google Squared appears to be doing something fuzzier and will return results that might be really bad.  So while W|A just says it doesn’t know, Google Squared will let you pick through the junk to find the gem.  Google Squared is expected to launch later this month in Google Labs.

Many have said that where W|A will really compete is against Wikipedia and I am inclined to agree.  There are plenty of things I go to Wikipedia for now that I probably will switch over to W|A for, like populations of countries, size of Neptune’s moons, and so on.  Wikipedia still wins for more in-depth knowledge on a topic.  W|A also does some pretty cool stuff when you search for the definition of a word (use a query like “word kitten“).  You learn that kitten comes from Classical Latin, and entered English about 700 years ago.  You can find out a similar thing (and go further in depth for the etymology at least) using the American Heritage dictionary on dictionary.com, but W|A requires less digging.

And this brings me around to a key point with W|A.  It’s an awesome factoid answering service.  It does it well and it does it in a pretty way.  Stuff you can find in more depth elsewhere you can get quickly and easily, but only superficially via W|A.  There are links to more information, though, so you don’t lose much by relying on W|A — assuming it has knowledge about what you’re looking for.  You’re still going to be more likely to hit a brick wall with W|A.

And of course, since Wolfram developed Mathematica, W|A is backed by it.  Enter an equation and you get some really handy math info back.  Need to quickly know the derivative of a fairly complicated equation?  Presto.  Probably the most satisfying feeling I got today was from a query similar to “what is the area under x^4+3x^2+4 from 1 to 8?“  Let’s see you answer that, Google Squared.

Wolfram|Alpha sample results

Reblog this post [with Zemanta]

There has been much ballyhoo in the blogosphere touting Google’s so-called foray into semantic search.  The blog post announcing the new feature doesn’t even mention the word semantics, but it does say it looks at associations and concepts related to your query.  I see no mention of tuples or anything of the sort and the suggested queries are the kind of thing that I would expect to come out of a background closer to document/query classification than semantic analysis.

Related search results for <i>much ado about nothing</i>

Related search results for much ado about nothing

And the results are pretty meh.  Except for taming of the shrew, those results are no-brainers.  That’s query completion quality results.  Of course you can’t judge the whole system by one isolated example.

When PC World and a host of other pop tech media zines started toasting the entrance of Google to the semantic arena, I was excited to see some cool stuff.  Imagine my disappointment when I was not only underwhelmed by the quality of the results, but by the lack of novelty.  How long has that feature been there?  Seems like I’ve seen it for ages.  Maybe it got a technological face-lift (I guess that would be a face-lift on the inside), but it looks about the same as I remember it.  Plus, its placement at the bottom of results page relegates it to search engine hell.

In summary:  boring.  My complaints are first and foremost with those elements of the blagoblag who over-hyped this.  Secondly, I am complaining to Google for not being better.  I am feeling demanding today.

Daniel’s post on it is worth reading.

It is bad journalism when an old news story is debunked and continues to be rehashed!  How sloppy!  Shame on you, Houston Chronicle!

Back around 2000, when Palem began thinking about the future of computer chip technology, power consumption wasn’t a big consideration. Only speed mattered.

But today, the energy consumed by information technology – a January news story likened the energy used in just two Google searches to boiling a kettle of tea – has become a major consideration.

Google debunked the results quite quickly after that article ran. Why is it acceptable to cite stories without checking on whether those stories are accurate? Isn’t this what we pay journalists for? I guess it’s too hard to check up on facts and instead we can just say there was a news story that reported it rather than making any claims about its correctness. Isn’t that what we have bloggers for?

Adwords Fail

Posted: 8 January 2009 in Uncategorized
Tags: , , , , , , ,

These are the gmail ads presented to me upon receiving an email with the subject “Nazi Israel.”  The text of the email contained no mention of Germany.

Adwords for "Nazi Israel"

Adwords for "Nazi Israel"

If wishes were *

Posted: 21 November 2008 in Uncategorized
Tags: , , , ,

A phrase that has been bouncing around in my head lately takes the general form: “If wishes were X, we would all Y.” The problem I’ve been having is that I can’t remember what X and Y are. After a brief googlevestigation, I came up with two prime candidates:

If wishes were horses, we would all ride.

If wishes were fishes, we would all cast nests.

The phrasing of the second part is flexible. One of my random candidate Xs was “butterflies” (because I like die Schmetterling) so I looked into that as an option and happened upon (of unknown origin):

If wishes were butterflies, we would never see the sun. [source]

I like that.

I just spent the day with a couple of friends at the Google App Engine Hackathon in Atlanta.  We got to see Google Atlanta – or the public part of it anyway.  We weren’t permitted in the cafeteria or in the actual office area, which would have required signing non-disclosure agreements.  The office was about what I expected — the Google colors were in abundance, there were giant bouncing balls, and free drinks! (non-alcoholic)

We spent the day in a fairly hot conference room hacking away on a variety of projects.  We set up teams beforehand to work on projects that people proposed and I chose to work on a variation of a computing puzzles site, dubbed LangWar.  The idea is fairly simple in the early stages:  people submit programming puzzles and other people post their solutions in code form.  You can vote which questions you like and which answers you like (or dislike).  You can also leave comments on questions and answers.  The result of the ratings is that the best questions will be counted higher, in a method similar to Reddit, and the best answers will trickle to the top based on the votes of users.

This is very similar to Stack Overflow, but different in that it is intended to be more of a puzzle solving site that pits implementations in different programming languages against each other.  It’s sort of a battle royale of programming languages – thus the name, LangWar.  It’s more of an enhanced version of Project Euler, where people can vote on the questions and answers.

In any case, it was a great chance to get my hands dirty in Google App Engine, meet some Atlanta python coders, and have fun.  It’ll be interesting to see where LangWar goes from here, if it does go anywhere.

Microsoft just announced a research project called U Rank, which aims to do pretty much just that.  You rank search results, share with friends, blah blah blah.  Basically it’s Mahalo with Microsoft branding plus a few trinkets.  And it’s backed by Live Search so you can feel confident the baseline results will be easy to beat.  According to their website, here are some of the things you might do with U Rank:

  • Organize and annotate results: write notes to summarize important information under each URL
  • Lists: keep lists while you’re researching
  • Collaboration: Share URLs with friends
  • Recommendations: Tell your friends what you like
  • Multimedia results: Mix video and images with web results for added context
  • Ego-boosting: Make sure your home page is #1 (at least for you and your friends)
  • Easy to explore what your friends are sharing
  • Short-cuts: Move your favorite sites up; then put an ! in front of the query and go straight to the top result

The first two are great and are obviously missing from other services like Google, though Google has means of achieving those things other than their main search product (e.g. Google Bookmarks and Google Notebook).  Mixing in video and multimedia content is ok, I reckon.  Doing it manually, though?  Meh. The last one is probably useless.  It will be fun for the first couple days and then you’ll return to using bookmarks or whatever you do already to keep track of favorite sites.

The rest of the features are social networking junk.  The third is somewhat useful, but when you want to share something with friends, you normally want to push it on them and make sure they get it.  So you IM them the link or email it to them.  I’m not sure any link I’ve shared on a website or in Google Reader has ever been viewed by anyone else.  The fourth one is no different than sharing the link in my mind.  I wonder what the use case for that is.  If it’s automated recommendations, that might be cool, though I’ve never seen it done well except for on StumbleUpon, and that’s more because the UI makes it easy to ignore bad recommendations.

Like Mahalo and Wikia Search, the main features require critical mass in community size.  What good is sharing crap with my friends if none of my friends use U Rank?  And ego-boosting is a fun toy for about 2 minutes until you realize it’s just information masturbation.

“Hey, ma, I’m number one in the search results!  Look at it on your computer.”
“You’re still on page 16 for me, dear.”

And yes, I’ve done it.  Just search for “mendicant bug” on Wikia.

So minus critical mass, U Rank is Live Search with window dressing and some annotation tools.  The annotation tools will probably be worth checking out, but will it be enough to lure me away from the trusted Googles?  I highly doubt it.  

I’m just not sold on this whole user-driven search idea.  With web search you are searching a ginormous collection of transient documents.  Human annotation can’t keep up with it.  That’s the whole point!  In order for it to, you’d need a lot of people using the service, and you’re just not going to get that.  As it stands right now, Google gets about 60% of the market share for searches.  Yahoo comes in a very distant second with about 20% and the rest are floating at under 10%.  Mahalo and friends get less than 3% of the search engine market share to divy up amongst themselves.  Is that enough to provide meaningful results?  Jason Calacanis might think so, but I’m skeptical.

I just finished reading about relevance-based language models for information retrieval (Lavrenko and Croft, 2001).  It’s an old paper, but some new stuff I was checking into relied on something else which relied on it — you know how the story goes.

In information retrieval, there are many retrieval models that have been used over the years.  Word on the street is that Google uses the vector space model, where the words in a document are represented as a vector.  Each word is its own dimension and the magnitude along that dimension is some weighting based on the number of times the word appears in that document.  A new query is converted into a vector in this space and how well a document matches the query is the distance between the two vectors.  This glosses over a lot of details, but that’s the general idea.

Another technique is to use language modeling.  A language model is built for each document and then the distance between the language model for a query and the language model for each document is used to rank the most relevant documents.  Again, a multitude of details have been glossed over.  The language modeling approach does a great job, and seems to be more theoretically grounded than the vector space model.  However, the vector space model does really well and there are many optimizations that make it easy to compute for huge datasets.

One thing that retrieval models have tried to do is model the documents relevant to a query.  These are the documents you want to return when a person searches for something.  If you knew the exact set of these documents, your job would be done and information retrieval would be solved.  So, it’s not an easy task.  and is further complicated by the fact that not everybody agrees which are the relevant documents for a particular query.  In probabilistic retrieval models this was done mainly with clunky heuristics that weren’t theoretically sound.  What Lavrenko and Croft (2001) did was create a formal approach to estimating the relevance model without any training data.  Sounds sweet, right?

What it amounts to is something called pseudo-relevance feedback.  Relevance feedback is the case where results are refined for queries based on labeled training data.  We know some relevant documents for certain queries, so we can use that to improve results for new queries.  Pseudo-relevance feedback requires no labeled data, but instead finds a way to simulate having the relevant documents.  Lavrenko and Croft did this by approximating the probability that a word would appear in the set of relevant documents by calculating the probability that the word would co-occur with the queries.

The handy part is you don’t have to do any pesky parameter estimation.  We just have to compute a bunch of probabilities, do some smoothing, and then hold our collective breath.  Check out the paper for details. 

References

Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR ’01. ACM, New York, NY, 120-127. [pdf]

Cuil Fail

Posted: 28 July 2008 in Uncategorized
Tags: , , , , , ,

The blagoblag is abuzz with word of cuil, a search engine launched by some former Google engineers.  After many hours of downtime, I was able to check it out a short while ago.  The unfortunate result:  it blows.  It’s so bad that it’s as bad as your brain can comprehend.  Supposedly there are three times as many sites indexed as Google.  Well, that’s because they have not filtered any of the spam sites out.  A search for “mendicant bug” yields multiple spam copies of my blog and some wordpress category pages on the first ten pages.  My blog is conspicuously missing.  A search for my name also yields pathetic garbage.  Multiple other searches all led to the same thing:  spam pages get the highest rankings.

If your goal is searching for spam, then try out cuil.  You might get lucky and get infected by some nasty spyware.  Otherwise, don’t waste your time.