Bayesian topic models


June 11, 2007


Tom Griffiths


UC Berkeley


Electronic documents provide vast amounts of information, but need to be organized in a way that lets people use that information.
Topic models provide one way of approaching this problem, automatically identifying the “topics” that appear in a collection of documents, and indicating the extent to which each document reflects each topic. I will summarize the basic ideas behind one such model, Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003), and use this model to describe how tools from Bayesian statistics can be useful in statistical natural language processing. In particular, I will describe a simple algorithm for identifying topics from documents, based on Markov chain Monte Carlo, and show how this simple topic model can be extended to incorporate syntax, model the interests of authors, infer topic hierarchies, and pick out topically coherent segments of dialogue.


Tom Griffiths

Tom Griffiths is an Assistant Professor of Psychology and Cognitive Science at UC Berkeley, where he is also affiliated with the Computer Science Division, the Institute for Brain and Cognitive Sciences, and the Helen Wills Neuroscience Institute. His research explores connections between human and machine learning, using ideas from statistics and artificial intelligence to try to understand how people solve the challenging computational problems they encounter in everyday life. He received his PhD in Psychology from Stanford University in 2005, and was a faculty member in the Department of Cognitive and Linguistic Sciences at Brown University before moving to Berkeley.