Bayesian Methods for Unsupervised Language Learning


July 16, 2007


Sharon Goldwater




Unsupervised learning of linguistic structure is a difficult task. Frequently, standard techniques such as maximum-likelihood estimation yield poor results or are simply inappropriate (as when the class of models under consideration includes models of varying complexity). In this talk, I discuss how Bayesian statistical methods can be applied to the problem of unsupervised language learning to develop principled model-based systems and improve results. I first present some work on word segmentation, showing that maximum-likelihood estimation is inappropriate for this task and discussing a nonparametric Bayesian modeling solution. I then argue, using part-of-speech tagging as an example, that a Bayesian approach provides advantages even when maximum-likelihood (or maximum a posteriori) estimation is possible. I conclude by discussing some of the challenges that remain in pursuing a Bayesian approach to language learning.


Sharon Goldwater

Sharon Goldwater is a postdoctoral scholar in the linguistics department at Stanford University, where she works with Dan Jurafsky, Chris Manning, and others in the Stanford natural language processing group. Her research focuses on unsupervised learning and computer modeling of language acquisition, particularly phonology and morphology. She completed her master’s degree in computer science in 2005 and her Ph.D. in linguistics in 2006, both from Brown University. Prior to graduate school, she worked as a researcher in the Artificial Intelligence Laboratory at SRI International. In addition to her work on unsupervised language learning, she has published papers on machine translation, statistical parsing, and human-computer dialogue systems.