Dealing With Data: Classification, Clustering and Ranking

April 11, 2006
Dengyong (Denny) Zhou | NEC Laboratories America, Inc.

This talk will be focused on the following three pieces of work that we have done:

How to utilize unlabeled data in classification? In many real-world machine learning problems, such as web categorization, only few labeled examples can be available since labeling needs human labor, and unlabeled data are far easy to obtain. So, naturally, one may wonder if we can utilize unlabeled data in our classification tasks. I will present a simple, powerful and mathematically clean approach to this problem, and demonstrate its good experimental results provided by the third party on a number of machine learning benchmarks. Our approach has been considered as state of the art in machine learning literature.
How to partition directed graphs like the Web? Spectral clustering for undirected graphs has been being extensively studied since a mathematician Fiedler’s seminal work in 1970’s. The spectral method is so powerful that many people have attempted to generalize it to directed graphs. Among them the most popular one is perhaps Jon Kleinberg’s HITS algorithm for both ranking web pages and detecting web communities. In 2003, Monika Henzinger, the former research director at Google Inc., listed this generalization issue as one of six algorithmic challenges in web search engines. I will show how we thoroughly solve this problem via Markov chain theory, and also the application of our approach to real-world web data. This approach can be implemented with several lines of Matlab code.
How to rank objects like images and texts? Link-based ranking has enjoyed a huge success in web search engines. However, in practice, many types of data have no link structure but being modeled as vectors in Euclidean spaces, for instance, texts and images. A principled way of ranking those kinds of data is to explore and exploit their intrinsic geometrical or manifold structure. I will show how we address this issue in a simple mathematical framework. Our approach has been widely used by different communities from image retrieval to bioinformatics.

In addition, I will also talk about some theoretic analysis around those approaches, and discuss future extensions.

Speaker Details

Dengyong Zhou is a research scientist in the Machine Learning Department of NEC Laboratories America (Princeton campus) since 2005. He is collaborating with Vladimir Vapnik on theory and algorithms of statistical machine learning. Prior to joining the NEC labs, he was a research scientist at the Max Planck Institute for Biological Cybernetics (Tuebingen, Germany), in the Empirical Inference Department headed by Bernhard Schoelkopf. He obtained a Ph.D. in computer science from the Chinese Academy of Sciences and a Presidential Award of the Chinese Academy of Sciences in 2000. He is currently serving on the Program Committees of ICML 2006 and ECML 2006, and he has been or is a reviewer to NIPS 04, NIPS 05, IJCAI 05, and to Journal of Machine Leaning Research, IEEE Transactions on Information Theory and IEEE Transactions on Neural Networks. His main research interests include semi-supervised learning, active learning, ranking, spectral clustering, kernel methods, learning theory and statistics, and their applications to information retrieval and bioinformatics.