Divide-and-Conquer and Statistical Inference for Big Data


October 25, 2012


Michael I. Jordan’s Keynote Speech on the 14th Computing in the 21st Century Conference co-hosted by Microsoft Research Asia, Nankai University and Tianjin University on October 25, 2012.

Abstract: Divide-and-conquer is a natural computational paradigm for approaching Big Data problems, particularly given recent developments in distributed and parallel computing, but some interesting challenges arise when applying divide-and-conquer algorithms to statistical inference problems. One interesting issue is that of obtaining confidence intervals in massive datasets. The bootstrap principle suggests resampling data to obtain fluctuations in the values of estimators, and thereby confidence intervals, but this is infeasible with massive data. Subsampling the data yields fluctuations on the wrong scale, which have to be corrected to provide calibrated statistical inferences. The new procedure, the “bag of little bootstraps,” circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile. Another issue is the problem of large-scale matrix completion. Here divide-and-conquer is a natural heuristic that works well in practice, but new theoretical problems arise when attempting to characterize the statistical performance of divide-and-conquer algorithms. Here the theoretical support is provided by concentration theorems for random matrices, and a new approach to this problem bases on Stein’s method.