Big Data Analytics

Established: October 18, 2012

We conduct research in the area of algorithms and systems for processing massive amounts of data. Our work aims at pushing the boundary of computer science in the area of algorithms and systems for large-scale computations. Our mission is to achieve major technological breakthroughs in order to facilitate new systems and services relying on efficient processing of big data.

Research Areas

Database queries – How can we efficiently resolve database queries on massive amounts of input data? Here the input data may be presented in the form of a distributed data stream.


Machine learning – How can we efficiently solve large-scale machine learning problems? Here the input data may be massive, stored in a distributed cluster of machines.


Distributed computing – How can we efficiently solve large-scale optimization problems in distributed computing environments? For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large scale graphs?


Big Data Analytics Lecture Series

Small Summaries for Big Data

Graham Cormode – AT&T Research

July 2nd, 2012, 16:00-17:00, Microsoft Research Cambridge, Jasmine Room

Abstract: In dealing with big data, we often need to look at a small summary to get the big picture. Over recent years, many new techniques have been developed which allow important properties of large distributions to be extracted from compact and easy-to-build summaries. In this talk, I’ll highlight some examples of different types of summarization: sampling, sketching, and special-purpose. Then I’ll outline the road ahead for further development of such summaries.

About the speaker: Homepage

Testing Properties of Discrete Distributions

Tugkan Batu – London School of Economics

May 15th, 2012, 16:00-17:00, Microsoft Research Cambridge, Lecture Theatre Small

Abstract: In this talk, we will survey some algorithmic results for several fundamental statistical inference tasks. The algorithms are given access only to i.i.d. samples from the discrete input distributions and make inferences based on these samples. The main focus of this research is the sample complexity of each task as a function of the domain size for the underlying discrete probability distributions. The inference tasks studied include (i) similarity to a fixed distribution (i.e., goodness-of-fit); (ii) similarity between two distributions (i.e., homogeneity); (iii) independence of joint distributions; and (iv) entropy estimation. For each of these tasks, an algorithm with sublinear sample complexity is presented (e.g., a goodness-of-fit test on a discrete domain of size $n$ is shown to require $O(sqrt{n}polylog(n))$ samples). Given certain extra information on the distributions (such as the distribution is monotone or unimodal with respect to a fixed total order on the domain), the sample complexity of these tasks become polylogarithmic in the domain size.

About the speaker: Homepage

Streaming Pattern Matching

Ely Porat – Bar-Ilan University, Israel

May 1st, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

Abstract: We present a fully online randomized algorithm for the classical pattern matching problem that uses merely O(log m) space, breaking the O(m) barrier that held for this problem for a long time. Our method can be used as a tool in many practical applications, including monitoring Internet traffic and firewall applications. In our online model we first receive the pattern P of size m and preprocess it. After the preprocessing phase, the characters of the text T of size n arrive one at a time in an online fashion. For each index of the text input we indicate whether the pattern matches the text at that location index or not. Clearly, for index i, an indication can only be given once all characters from index i till index i+m-1 have arrived. Our goal is to provide such answers while using minimal space, and while spending as little time as possible on each character (time and space which are in O(poly(log n))).

About the speaker: Homepage

Basic Probabilistic Load Balancing Mechanisms

Tom Friedetzky – Durham University

April 24th, 2012, 15:00-16:00, Microsoft Research Cambridge, Lecture Theatre Large

Abstract: In this talk we will discuss a number of basic yet useful load balancing mechanisms for parallel and distributed computations, based on random allocation (“balls into bins”) and randomised work stealing. The focus will be on approaches that lend themselves to thorough mathematical analysis but that, due to their simplicity and general easiness on assumptions, may be considered to be good, all-purpose performers. The talk will mention theoretical results and occasionally hint at proof strategies but most parts will be accessible to a general audience.

About the speaker: Homepage

Large Scale Semidefinite Programming

Alexandre d’Aspremont – Ecole Polytechnique, France

April 17th, 2012, 15:00-16:00, Microsoft Research Cambridge, Primrose Room

Abstract: The talk will start by a brief introduction on semidefinite programming. It will discuss some recent advances in large scale semidefinite programming solvers, detailing in particular stochastic smoothing techniques for the maximum eigenvalue function.

Joint work with Noureddine El Karoui at U.C. Berkeley.

About the speaker: Homepage

The Online Approach to Machine Learning

Nicolo Cesa-Bianchi – Universita degli Studi di Milano

February 28th, 2012, 14:00-15:00, Microsoft Research Cambridge, Large Lecture Theatre

Abstract: Online learning has become a standard tool in machine learning and large-scale data analysis. Learning is viewed as a repeated game between an adaptive agent and an ever-changing environment. Within this simple paradigm, one can model a variety of sequential decision tasks simply by specifying the interaction protocol and the resource constraints on the agent. In the talk we will first highlight some of the main features of online learning algorithms, such as simplicity, locality, scalability, and robustness. Then, we will describe algorithmic applications to specific learning scenarios (partial feedback, attribute-efficient, multitask, semi-supervised, transductive, and more) showing how diverse settings can be effectively captured within the same conceptual framework.

About the speaker: Nicolo Cesa-Bianchi

Logic and Analysis Issues in Web-based Data Integration

Michael Benedikt – Oxford University

January 31st, 2012, 14:00-15:00, Microsoft Research Cambridge, Primrose Room

Abstract: A prime driver for much database research over the past decade has been providing unified structured (relational) query interfaces on top of web-based datasources. There are a range of issues that come up in doing this I will talk try to give an idea of a few of them, focusing on several we have worked on at Oxford: analysis of dynamic access plans, answerability of queries on the Web, and analysis of web pages to support query answering.

This is joint work with Pierre Bourhis, Tim Furche, Georg Gottlob, Andreas Savvides, and Pierre Senellart.



  • Srikanta Tirthapura, Iowa State University, to join May 2013
  • Ely Porat, Bar-Ilan University, Israel, June 2012


  • Nan Li, University of California Santa Barbara, 2012/13
  • Charalampos Tsourakakis, CMU, 2012
  • Zengfeng Huang, Hong Kong University of Science and Technology, 2012
  • Alan Roytman, UCLA, 2012
  • Zhenming Liu, Harward University, now a post-doc at Princeton University, 2011
  • Dinkar Vasudevan, EPFL, 2008


  • Microsoft Online Services Division