Big Data Analytics 2013



We witness a rapid development of the research and technology for efficient processing of big data. There is a surge of commercial and open source platforms for big data analytics, including platforms for querying of massive datasets, batch processing, real-time analytics, streaming computations, iterative computations, graph data processing, and distributed machine learning. There have been some remarkable achievements on the side of designing scalable and efficient algorithms for processing of massive amounts of data, as well as on the side of architecture of systems and infrastructure. The goal of this workshop is to bring together researchers and system architects to discuss and identify the most important and challenging directions to push forward the area of algorithms and systems for big data. The topics of the workshop include but are not limited to computation and storage platforms, querying of massive datasets, sketches, streaming, scaling up distributed machine learning, iterative computations, and large-scale graph processing.







Day 1

8:00 – 8:45
Continental breakfast

8:45 – 9:00
Welcome note

9:00 – 10:00 | Abstract | Slides | Video
Keynote talk, Gerhard Weikum, Max Planck Institute for Informatics
From Text to Entities and from Entities to Insight: a Perspective on Unstructured Big Data

10:00 – 10:30 | Abstract
Anastasia Ailamaki, EPFL
Querying and Exploring Big Brain Data

10:30 – 11:00

Programming and Computation Models

11:00 – 11:30 | Abstract | Slides | Video
Volker Markl, TU-Berlin
Big Data with Stratosphere

11:30 – 12:00 | Abstract
Jingren Zhou, Microsoft Corporation
SCOPE: Parallel Databases Meet MapReduce

12:00 – 13:00

Algorithms, Infrastructure, and Platforms

13:00 – 13:30 | Abstract | Slides | Video
Flavio P. Junqueira, Microsoft Research
Online Data Processing with S4 and Omid

13:30 – 14:00 | Abstract | Slides | Video
Thomas Karagiannis, Microsoft Research
Predictable Data Centers

14:00 – 14:30 | Abstract
Sergei Vassilvitskii, Google
From Terabytes to Megabytes: Finding the Needle by Shrinking the Haystack

14:30 – 15:00 | Abstract | Slides | Video
Frank McSherry, Microsoft Research
Incremental, Iterative, and Interactive Computation using Differential Dataflow

15:00 – 16:00 | See Demos and Posters Tab
Refreshments + Demo and Poster Session

Large-scale Graph Analytics

16:00 – 16:30 | Abstracts | Slides
Guy Blelloch, Carnegie Mellon University
Big Data on Small Machines

16:30 – 17:00 | Abstracts | Slides
Sudipto Guha, University of Pennsylvania
Graphs and Linear Measurements

17:00 – 17:30 | Abstract | Slides
George Karypis, University of Minnesota
Partitioning & Clustering Big Graphs

17:30 – 18:00 | Abstract | Slides
Stefano Leonardi, Sapienza University of Rome
Online Team Formation in Social Networks

19:00 – 21:00
Workshop reception and dinner at Emmanuel College

Day 2

8:30 – 9:00
Coffee and pastries

9:00 – 10:00 | Abstract
Keynote talk
Surajit Chaudhuri, Microsoft Research
Big Data and Enterprise Analytics

10:00 – 10:30 | Abstract | Slides | Video
Graham Cormode, University of Warwick
Streaming Verification of Outsourced Computation

10:30 – 11:00
Refreshments + Group photo

11:00 – 12:00 | Panel
Graham Cormode, University of Warwick
Surajit Chaudhuri, Microsoft Research
Sudipto Guha, University of Pennsylvania
Sergei Vassilvitskii, Google
Jingren Zhou, Microsoft Corporation
Big Data Analytics: A Happy Marriage of Systems and Theory?

12:00 – 13:30

13:30 – 14:30 | Abstract
Keynote talk, Sanjeev Khanna, University of Pennsylvania
Fast Algorithms for Perfect Matchings in Regular Bipartite Graphs

Large-Scale Graph Analytics Cont’d

14:30 – 15:00 | Abstract | Slides
Aleksandar Madry, EPFL
Cuts, Trees, and Electrical Flows

15:00 – 15:30 | Abstract | Slides | Video
Srikanta Tirthapura, Iowa State University
Neighborhood Sampling for Estimating Local Properties on a Graph Stream

15:30 – 16:00

Streaming and Massive Distributed Data

16:00 – 16:30 | AbstractVideo
Amit Chakrabarti, Dartmouth College
What Can’t We Compute on Data Streams?

16:30 – 17:00 | Abstract | Slides | Video
Minos Garofalakis, Technical University of Crete
Querying Big, Dynamic, Distributed Data

17:00 – 17:30 | Abstract | Slides | Video
Ke Yi, Hong Kong University of Science and Technology
Mergable Summaries

17:30 – 18:00
Closing session

Keynote Speakers

Surajit Chaudhuri

surajitcDistinguished Scientist and Managing Director, Microsoft Research

Web Link

Surajit Chaudhuri, Distinguished Scientist and a Managing Director at Microsoft Research, will discuss the key secular trends that characterize the field of Big Data with respect to enterprise analytics. He will describe some of the open challenges for enterprise analytics in the context of Big Data.

Sanjeev Khanna

sanjeev_picProfessor, University of Pennsylvania

Web Link

Sanjeev Khanna is the Henry Salvatori Professor of Computer and Information Science at the University of Pennsylvania. His primary research interests are in the design and analysis of algorithms for combinatorial optimisation and in complexity theory. His research has been supported by National Science Foundation, an Alfred P. Sloan Fellowship, a Guggenheim Fellowship, and an IBM Faculty Award. Sanjeev serves on the Editorial boards of SICOMP, ACM TALG, and Foundations and Trends in Theoretical Computer Science and served as an area editor for Encyclopaedia of Algorithms.

Gerhard Weikum

gerhardweikumScientific Director, Max Planck Institute for Informatics

Web Link

Gerhard Weikum is a Scientific Director at the Max Planck Institute for Informatics in Saarbruecken, Germany, and also an Adjunct Professor at Saarland University. His research spans transactional and distributed systems, self-tuning database systems, DB&IR integration, and automatic knowledge harvesting from Web and text sources. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of the Yago knowledge base. Gerhard Weikum is an ACM Fellow and a member of the German Academy of Science and Engineering. He received a Google Focused Research Award in 2010, and the ACM SIGMOD Contributions Award in 2011.


Surajit Chaudhuri

surajitcSurajit Chaudhuri, Distinguished Scientist and a Managing Director at Microsoft Research, will discuss the key secular trends that characterize the field of Big Data with respect to enterprise analytics. He will describe some of the open challenges for enterprise analytics in the context of Big Data.

Sudipto Guha

guhaGuha is an Associate Professor in the Department of Computer and Information Sciences at University of Pennsylvania since Fall 2001. He completed his Ph.D. in 2000 at Stanford University working on approximation algorithms. He is a recipient of the NSF CAREER award in 2007, and the Alfred P. Sloan Foundation fellowship.

Sergei Vassilvitskii

vassilvitskiiSergei Vassilvitskii is a Research Scientist at Google and an Adjunct Assistant Professor of Computer Science at Columbia University. He received his M.S. and Ph.D. in Computer Science from Stanford University and a B.S. from Cornell University. He works on problems at the intersection of economics, data mining and computer science.

Jingren Zhou

jingrenJingren Zhou is a Partner Development Manager in the Bing’s Infrastructure Team. Jingren manages a team to develop a cloud-scale distributed computation system, called SCOPE, targeted for massive data analysis over tens of thousands of machines at Microsoft Bing. The SCOPE system combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. The system processes petabytes of data daily for a variety of data analysis and data mining applications, powering Bing and other online services at Microsoft. Before he joined Bing Search in 2008, Jingren was a researcher in the database research group at Microsoft Research. He has published many articles in premier database conferences and journals. Jingren received a Ph.D. in Computer Science from Columbia University.

Graham Cormode

cormodeGraham Cormode is a professor of Computer Science at the University of Warwick, UK. He works on topics around big data, streaming analytics, privacy and data mining. From 2006-2013, he was a researcher at AT&T Labs, and prior to that he was at Bell Labs



  • Anastasia Ailamaki, EPFL
  • Dan Alistarh, MIT
  • Hitesh Ballani, Microsoft Research
  • Alexandre de Baynast, Microsoft ATL Europe
  • Guy Blelloch, Carnegie Mellon University
  • Lucas Bordeaux, Microsoft Research
  • Florian Bourse, ENS Paris
  • John Bronskill, Microsoft Research
  • Miguel Castro, Microsoft Research
  • Nuno Cerquiera, Microsoft (Data at Skype)
  • Amit Chakrabarti, Dartmouth College
  • Stephen Clark, University of Cambridge
  • Raphael Clifford, Bristol University
  • Graham Cormode, University of Warwick
  • Paolo Costa, Microsoft Research
  • Artur Czumaj, University of Warwick
  • Valentin Dalibard, University of Cambridge
  • Bryan Dove, Microsoft (Data at Skype)
  • Moez Draief, Imperial College
  • Minos Garofalakis, Technical University of Crete
  • Richard Gibbens, University of Cambridge
  • Christos Gkantsidis, Microsoft Research
  • Thore Graepel, Microsoft Research
  • Peter Grindrod, University of Reading
  • Sudipto Guha, University of Pennsylvania
  • Mehdi Hosseini, University of Cambridge
  • Markus Jalsenius, Bristol University
  • Flavio Paiva Junqueira, Microsoft Research
  • Thomas Karagiannis, Microsoft Research
  • George Karypis, University of Minnesota
  • Ian Kash, Microsoft Research
  • Frank Kelly, University of Cambridge
  • Peter Key, Microsoft Research
  • Pushmeet Kohli, Microsoft Research
  • Philipp Kranen, MSR ATL Europe
  • Sergey Legtchenko, Microsoft Research
  • Marc Lelarge, INRIA-ENS Paris
  • Stefano Leonardi, Sapienza University of Rome
  • Aleksander Madry, EPFL
  • Volker Markl, TU Berlin
  • Laurent Massoulie, INRIA
  • Frank McSherry, Microsoft Research
  • Derek Murray, Microsoft Research
  • Dushyanth Narayanan, Microsoft Research
  • Karthik Nilakant, University of Cambridge
  • James Norris, University of Cambridge
  • Alexandre Proutiere, KTH Stockholm
  • Ant Rowstron, Microsoft Research
  • Benjamin Sach, University of Warwick
  • Richard Samworth, University of Cambridge
  • Eirini Spyropoulou, University of Bristol
  • Maxim Sviridenko, University of Warwick
  • Don Syme, Microsoft Research
  • Srikanta Tirthapura, Iowa State University
  • Andy Twigg, University of Oxford
  • Sergei Vassilvitskii, Google
  • Milan Vojnovic, Microsoft Research
  • Gerhard Weikum, Max Planck Institute for Informatics
  • Ken Woodberry, Microsoft Research
  • Ke Yi, Hong Kong University of Science and Technology
  • Eiko Yoneki, University of Cambridge
  • Reynold Xin, University of California, Berkeley
  • Jingren Zhou, Microsoft Corporation (Bing)

Demos and Posters

Big Metadata - Programming at Internet-Scale

Don Syme, Kenji Takeda, and Keith Battocchi (Microsoft Research)

We live in an information society. Programming languages are not designed for this. As we move into a world of “Devices + Services”, it is vital that developers can productively integrate information at internet-scale into their everyday programming environment. Effectively exploring, navigating, understanding, analysing and presenting data – both big and broad – is a key to success for developers. Current experiences for this are cumbersome & labour-intensive. In this new demo we explain the latest research behind F# Type Providers and show how they can bring Internet-Scale Metadata to the fingers of developers. IntelliSense for Internet-Scale Metadata creates a completely new way of rapidly developing applications on Azure, when using Hadoop, and when programming against the web. We explain how this can transform F# into a uniquely powerful and developer-friendly Data Science scripting language for Azure.

Incremental, Iterative and Interactive Computation in Naiad

Frank McSherry, Derek Murray, Rebecca Isaacs, and Michael Isard (Microsoft Research)

We are developing a system for scalable data analysis designed to support iterative queries over continuously changing inputs at interactive timescales. The system is based on a new computational framework, differential dataflow, that generalizes standard incremental dataflow for far greater reuse of previous results when collections change. Differential dataflow distinguishes between the multiple reasons a collection might change, including both loop feedback and new input data, allowing a system to re-use the most appropriate results from previously performed work when an incremental update arrives.

Our prototype system, Naiad, demonstrates the practical application of differential dataflow. Like many systems, Naiad supports high-level declarative queries, data-parallel execution, and transparent distribution. Unlike other systems, Naiad efficiently executes queries with multiple (possibly nested) loops, while simultaneously responding with low latency to incremental changes to the inputs. We show how differential dataflow enables orders of magnitude speedups for a variety of complex workloads on real streaming data.

Dynamic Task Allocation in Asynchronous Shared Memory

Dan Alistarh (MIT), James Aspnes (Yale University), Michael A. Bender (Stony Brook University), Rati Gelashvili (MIT), and Seth Gilbert (NUS)

Task allocation is a classic distributed problem in which a set of p potentially faulty processes must perform a set of m tasks. The problem is all the more challenging when there is heterogeneity, either on the workers’ side, since individual agents may have varying speed and robustness, or because of uneven workloads. In large-scale systems, heterogeneity is the norm. Our work considers the dynamic version of task allocation, in which tasks are injected adversarially during an asynchronous execution. Intuitively, a major challenge in this setting is the fact that, at the same time, the adversary controls scheduling, process crashes, and the input.

We give a new shared-memory algorithm for dynamic task allocation which is within logarithmic factors of optimal. The main algorithmic idea is a randomized data structure called a dynamic to-do tree, which allows processes to pick new tasks to perform uniformly at random from the set of available tasks, and to insert tasks at uniform random locations in the data structure. We show that these properties avoid duplicating work unnecessarily in well-behaved executions, and that work duplication is inherent under certain input-schedule combinations. This is the first algorithm to allow efficient work sharing in this challenging setting, and the result has applications to other problems, such as producer-consumer buffers, and distributed renaming.

RASP: Large-Scale Graph Traversal with SSD Prefetching

Eiko Yoneki (University of Cambridge), Karthik Nilakant (University of Cambridge), Valentin Dalibard (University of Cambridge), and Amitabha Roy (EPFL)

Mining large graphs has now become an important aspect of multiple diverse applications and a number of computer systems have been proposed to efficiently execute graph algorithms. Recent interest in this area has led to the construction of single machine graph computation systems that use solid state drives (SSDs) to store the graph. This approach reduces the cost and simplifies the implementation of graph algorithms, making computations on large graphs available to the average user. However, SSDs are slower than main memory, and making full use of their bandwidth is crucial for executing graph algorithms in a reasonable amount of time. We present RASP (the (R)un(A)head (S)SD(P)refetcher) for graph algorithms that parallelises requests to derive maximum throughput from SSDs. RASP combines a judicious distribution of graph state between main memory and SSDs with an innovative run-ahead algorithm to prefetch needed data in parallel. This is in contrast to existing approaches that depend on multi-threading the graph algorithms to saturate available bandwidth. Our experiments on graph algorithms using random access

show that RASP not only is capable of maximising the throughput from SSDs but is also able to almost hide the effect of I/O latency. The improvements in runtime for graph algorithms is up to 14 X when compared to a single threaded baseline. When compared to sophisticated multi-threaded implementations, RASP performs up to 80% faster without the program complexity and the programmer effort needed for multithreaded graph algorithms.

Active Data Management for Distributed Graph Processing

Karthik Nilakant (University of Cambridge) and Eiko Yoneki (University of Cambridge)

Graph processing systems are subject to two conflicting concerns: firstly, a large number of algorithms for processing graphs exist, most of which exhibit poor locality of accesses. Secondly, the increasing scale of data has resulted in a need to distribute the data across multiple machines, further affecting locality. Most existing systems tend to consider these factors separately. We propose schemes for pro-active or reactive active graph management. Pro-actively, one could instrument algorithmic code fragments to allow the processing engine to predict future behaviour of the graph algorithm. Producing “balanced cuts” on a graph is computationally expensive, and infeasible for large datasets. Instead, lightweight graph analytics could be deployed to re-arrange the graph into less tightly coupled clusters. We propose a framework for considering these factors in unison. Ultimately, the choice of method to use in a given scenario will depend on the structure of the dataset and the processing algorithm. Our focus will be on identifying the minimum characteristics that must be gleaned from the code or data to be processed (or alternatively provided by the user), in order to boost the performance of such a system.

Infrastructure for real-time analytics on modern cluster networks

Aleksandar Dragojevic, Dushyanth Narayanan, Orion Hodson, Miguel Castro (Microsoft Research)

Two hardware trends present a great opportunity for building efficient infrastructure for real-time analytics: (1) servers today have tens to hundreds of gigabytes of RAM making it possible to keep large data sets in RAM of a moderately-sized cluster of servers and (2) modern cluster networks, such as InfiniBand and RoCE, support access to memory of remote servers with just a few micro-seconds of latency. By keeping the whole data set in main memory of a cluster and using direct remote memory access (RDMA) primitives of the network it is possible to support tens to hundreds of millions of point lookups in a terabyte-scale key-value store on a cluster of just several tens of servers.

We present Farm, a platform for building infrastructure for supporting in-memory applications that efficiently use modern cluster networks. Farm exposes the memory of all servers in a cluster as a single, fault-tolerant shared memory; it supports atomic reads and writes of small user-defined objects, efficient short transactions for easier synchronization, and exposes the location of objects to support locality-aware optimizations. We describe how to use Farm to build efficient and scalable B-trees and hashtables, which can be used as key-value stores or building blocks for more complex systems.

Balanced Edge-Partitioning of Massive Scale Graphs

Florian Bourse (Ecole Normale Supérieure, Paris), Marc Lelarge (INRIA- Ecole Normale Supérieure, Paris), and Milan Vojnovic (Microsoft Research)

Many computation tasks are performed on large scale graph data which may amount to as many as billions or even trillions of vertices or edges (e.g. online social network graphs, knowledge graphs, and web graphs). A standard way to scale up such computations is to partition an input graph into clusters of balanced sizes and small cut costs. Graph partitioning is of interest for a wide range of systems including iterative computation platforms, distributed graph databases, and stream processing platforms. Besides its practical importance, graph partitioning is one of the most fundamental and challenging theoretical computer science problems.

In our work we consider novel graph partitioning problems that arise under some computation models of interest — which specify how the size of a cluster and the cost of a cut are defined. We devote special focus to online assignment algorithms that require a single pass through vertices or edges. Our preliminary performance evaluation results demonstrate significant benefits of using graph partitioning algorithms that are specifically tailored for given computation model.

This is a collaboration project with MSR-INRIA joint research centre.

Mining Interesting Patterns in Multi-Relational Data

Eirini Spyropoulou (University of Bristol) and Tijl De Bie (University of Bristol)

Research on local pattern mining algorithms has focused for years on mining one table of transactions and items (market basket data). However, a large amount of real-world datasets are multi-relational, i.e., they contain more than two entities and more than one relationships. There is therefore increasing interest in developing techniques for mining such complex types of data. Defining a multi-relational pattern syntax that is suitably expressive and finding an efficient mining algorithm, is a challenging problem. Furthermore, since local pattern mining methods usually suffer from the combinatorial explosion of patterns in the output, assessing the quality of patterns is also very important in order to make them useful to the end user.

In this work we introduce a novel approach for mining patterns in multi-relational data. We propose the pattern syntax of Complete Connected Subsets, a new syntax for multi-relational patterns. While this pattern syntax is generally applicable to multi-relational data it reduces to well known itemsets when the data is a simple transaction table. We also introduce RMiner, a simple yet practically efficient algorithm to mine such patterns. We show how the interestingness of patterns of the proposed syntax can be conveniently quantified using a framework for quantifying the subjective interestingness of patterns according to the user’s prior knowledge. Finally, we present results on real world data sets.