Machine Learning Day 2013 – Deep Learning (but not the kind you were thinking of); A Bayesian Information Criterion for Singular Models


October 18, 2013


Ran Gilad-Bachrach and Mathias Drton


MSR, University of Washington


Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. As an alternative, we define the depth of hypotheses and show that generalization and robustness can be bounded as a function of this depth. Therefore, we suggest using the median hypothesis, which is a deep hypothesis, and present algorithms for approximating it.

One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work.

The presentation is based on a joint work with Chris Burges.

The Bayesian Information Criterion (BIC) is a widely used model selection technique that is inspired by the large-sample asymptotic behavior of Bayesian approaches to model selection. In this talk we will consider such approximate Bayesian model choice for problems that involve models whose Fisher-information matrices may fail to be invertible along other competing submodels. When models are singular in this way, the penalty structure in BIC generally does not reflect the large-sample behavior of their Bayesian marginal likelihood. While large-sample theory for the marginal likelihood of singular models has been developed recently, the resulting approximations depend on the true parameter value and lead to a paradox of circular reasoning. Guided by examples such as determining the number of components of mixture models, the number of factors in latent factor models or the rank in reduced-rank regression, we propose a resolution to this paradox and give a practical extension of BIC for singular model selection problems.

Joint work with Martyn Plummer.


Ran Gilad-Bachrach and Mathias Drton

Ran Gilad-Bachrach earned his Ph.D. from the Hebrew university in Jerusalem. Following that he joint Intel Research to lead a small group of researchers to study applications of machine learning for improving distributed computing. Later he joined Bing to work on whole page relevance. His latest position is in Microsoft research as a member of the machine learning group. His work focuses on machine learning, both theory and applications in the medical domain.

Mathias Drton is Professor of Statistics at the University of Washington. A native of Germany, he received his PhD in Statistics from UW in 2004. After a Postdoc in Mathematics at UC Berkeley, he spent seven years at the University of Chicago before returning to Seattle in 2012.