Active Learning and Annotation

July 31, 2015
Sanjoy Dasgupta | UC Sandiego
MSR India Summer School 2015 on Machine Learning

The “active learning” model is motivated by scenarios in which it is easy to amass vast quantities of unlabeled data (images and videos off the web, speech signals from microphone recordings, and so on) but costly to obtain their labels. Like supervised learning, the goal is ultimately to learn a classifier. But the labels of training points are hidden, and each of them can be revealed only at a cost. The idea is to query just a few labels that are especially informative about the decision boundary, and thereby to obtain an accurate classifier at significantly lower cost than regular supervised learning.

There are two distinct ways of conceptualizing active learning, which lead to rather different querying strategies. The first treats active learning as an efficient search through a hypothesis space of candidates, while the second has to do with exploiting cluster or neighborhood structure in data. This talk will show how each view leads to active learning algorithms that can be made efficient and practical, and have provable label complexity bounds that are in some cases exponentially lower than for supervised learning.

Speaker Details

“Sanjoy Dasgupta obtained his undergraduate degree from Harvard College in 1993. He worked for a year at Bell Laboratories and since then has been a graduate student at U.C. Berkeley, under the supervision of Umesh Vazirani. His thesis work, which will be completed this December, is motivated by the need for efficient and provably good learning algorithms for various commonly-used families of probability distributions.”