Large-Scale Learning by Data Compression

March 3, 2010
Jian Zhang | Purdue University

An important challenge in machine learning is how to efficiently learn from massive training data sets, especially with limited storage and computing capability. In this talk we introduce an efficient learning method called “compressed classification”, which aims to compress observations into a small number of pseudo-examples before classification. By analyzing the convergence rate of the risk, we show the classifiers learned from compressed data can closely approximate the non-compressed classifiers by effectively reducing the noise variance. We also present a hierarchical local grouping algorithm to iteratively split observations into local groups, which leads to a faster compression process than the single-layer counterpart. Our experiments with simulated and real datasets show that the proposed local-grouping-based compression method can outperform several other compression methods, and achieve competitive performance with non-compressed baseline using much less learning time for both small-scale and large-scale classification problems.

Speaker Details

Jian Zhang is an Assistant Professor of Department of Statistics of Purdue University. He obtained his PhD in computer science from Carnegie Mellon University, School of Computer Science, in 2006. His researches are statistical machine learning and information retrieval.