I just finished reading about relevance-based language models for information retrieval (Lavrenko and Croft, 2001). It’s an old paper, but some new stuff I was checking into relied on something else which relied on it — you know how the story goes.
In information retrieval, there are many retrieval models that have been used over the years. Word on the street is that Google uses the vector space model, where the words in a document are represented as a vector. Each word is its own dimension and the magnitude along that dimension is some weighting based on the number of times the word appears in that document. A new query is converted into a vector in this space and how well a document matches the query is the distance between the two vectors. This glosses over a lot of details, but that’s the general idea.
Another technique is to use language modeling. A language model is built for each document and then the distance between the language model for a query and the language model for each document is used to rank the most relevant documents. Again, a multitude of details have been glossed over. The language modeling approach does a great job, and seems to be more theoretically grounded than the vector space model. However, the vector space model does really well and there are many optimizations that make it easy to compute for huge datasets.
One thing that retrieval models have tried to do is model the documents relevant to a query. These are the documents you want to return when a person searches for something. If you knew the exact set of these documents, your job would be done and information retrieval would be solved. So, it’s not an easy task. and is further complicated by the fact that not everybody agrees which are the relevant documents for a particular query. In probabilistic retrieval models this was done mainly with clunky heuristics that weren’t theoretically sound. What Lavrenko and Croft (2001) did was create a formal approach to estimating the relevance model without any training data. Sounds sweet, right?
What it amounts to is something called pseudo-relevance feedback. Relevance feedback is the case where results are refined for queries based on labeled training data. We know some relevant documents for certain queries, so we can use that to improve results for new queries. Pseudo-relevance feedback requires no labeled data, but instead finds a way to simulate having the relevant documents. Lavrenko and Croft did this by approximating the probability that a word would appear in the set of relevant documents by calculating the probability that the word would co-occur with the queries.
The handy part is you don’t have to do any pesky parameter estimation. We just have to compute a bunch of probabilities, do some smoothing, and then hold our collective breath. Check out the paper for details.
Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR ’01. ACM, New York, NY, 120-127. [pdf]