Machine Learning Day 2013 – Clustering: Probably Approximately Useless; Geometry Preserving Non-Linear Dimension Reduction


October 18, 2013


Rich Caruana and Marina Meila


Microsoft Research-Redmond, University of Washington


Clustering: Probably Approximately Useless?, Rich Caruana (MSR)

Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn’t. The problem is that clustering rarely finds the groups *you* want, or expected, or that are most useful for the task at hand. There are so many good ways to cluster a dataset that the odds of coming up with the clustering that is best for what you’re doing are small. How do we fix this and make clustering more useful in practice? How do we make clustering do what you want, while still giving it the freedom to “do its own thing” and surprise us?

Geometry preserving non-linear dimension reduction, Marina Meila (UW Statistics)

In recent years, manifold learning has become increasingly popular as a tool for performing non-linear dimensionality reduction. This has led to the development of numerous algorithms of varying degrees of complexity that aim to recover the underlying low-dimensional parameters of the data using either local or global features.

It is also widely recognized that the low dimensional parametrizations will typically distort the geometric properties of the original data, like distances, angles, areas and so on. These distortions depend both on the data and on the algorithm used.

Building on the Laplacian Eigenmap framework, we propose a new paradigm that offers a guarantee, under reasonable assumptions, that *any* manifold learning algorithm will preserve the geometry of a data set. Our approach is based on augmenting the output of an algorithm with geometric information, embodied in the Riemannian metric of the manifold. The Riemannian metric allows us to compute geometric quantities (such as angle, length, or volume) for any coordinate system or embedding of the manifold. This geometric faithfulness, which is not guarante edfor most algorithms, allows us to define geometric measurements that are inde pendent of the algorithm used, and hence move seamlessly from one algorithm to another. In this work, we provide an algorithm for estimating the Riemannian metric from data and demonstrate the advantages of our approach in a variety of examples.

As an application of this new framework, we develop a new, principled, unsupervised to selecting the scale parameter in manifold learning, based on optimizing the geometric self-consistency w.r.t the scale.

This talk will not require any knowledge of advanced mathematics or manifold learning.

Joint work with Dominique Perrault-Joncas.


Rich Caruana and Marina Meila

Rich Caruana joined Microsoft in 2008 as a Senior Researcher in machine learning. Before joining Microsoft he was on the faculty in the Computer Science Department at Cornell University. Most of Rich’s research is in machine learning and data mining, and in the application of these to areas such as medical informatics, web ranking, and ecology.

Marina Meila is interested in computationally efficient statistical learning methods in domains with high dimensions, or with combinatorial or algebraic structure. Currently, she works in models over ranked data, non-linear dimension reduction, diffusions on graphs and compressed sensing. Some of her past contributions are in maximum entropy discrimination, graphical models, and the theoretical foundations of clustering. Dr. Meila received her MS from the Polytechnic Institute of Bucharest and her PhD from M.I.T. and is currently Associate Professor of Statistics and Adjunct Associate Professor of Computer Science at the University of Washington, Seattle.