Latent Dirichlet Allocation (LDA) is an unsupervised method of finding topics in a collection of documents. It posits a set of possible topics from which a subset are selected for each document. This selected mixture of topics represents the topics discussed in the document, and each word in the document is generated by this mixture. As a quick example, if we had a short document with the topics geology and astronomy:
The rover traveled many millions of miles through space to arrive at Mars. Once there, it collected soil samples and examined them to determine if liquid water had ever been present on the surface.
In this case, the topic astronomy is represented in red and geology in green. LDA finds these latent topics in an unsupervised fashion using the EM algorithm. EM is a two step process for estimating parameters in a statistical model. The nice thing about it is that it’s guaranteed to converge to a local maximum (not necessarily the global!). However, it can take a while to converge, depending on the size and nature of the data and model. While I was in school, EM was one of the most confusing concepts, and I’m still not 100% on it, but it makes a lot more sense now than it did before.
In the context of LDA, EM is basically doing two things. First, we come up with an idea about how the topics are distributed. Next, we look at the actual words and compute the probabilities in the model based on those hypothesized topics. Eventually we converge to a local “best” set of topics. These may not correspond to realistic topics, but they maximize the negative log probability of the model. Usually LDA does a pretty good job of finding explainable topics given a decent amount of data.
For more details about LDA, check out the paper by Blei et al (2003). LDA has been extended in a number of different directions since the original paper, so it’s essential reading if you’re doing any sort of topic modeling .
D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022. [pdf]