Big Data Platforms

July 15, 2013
Ion Stoica, Raghu Ramakrishnan, and Vivek Narasayya | Microsoft, University of California, eXtreme Computing Group
Faculty Summit 2013

“Big Data” holds the promise of having a transformative effect for enterprises and consumers alike. In this session of the 2013 Microsoft Research Faculty Summit, leaders of our community discuss how data platforms must evolve to realize the potential of Big Data. The first talk in the session outlines how Microsoft is reimagining its data platform strategy taking into account not only the need to support characteristics of Big Data but also the importance of cloud as the delivery mechanism. The second talk describes an architecture that pulls together the core system components necessary to support data analytics at scale. The final talk tackles the challenge of performance isolation that arises when multiple data analytics tenants share the resources in a cloud platform.

Big Data at Microsoft
Speaker: Raghu Ramakrishnan, Microsoft
We are seeing arguably the most sweeping changes in data management since the relational database revolution of the ‘70s and ‘80s. “Big Data,” pushing the limits of data volumes, variety of data types and analyses, and real-time response, has become mainstream. Cloud as a delivery channel has been growing simultaneously. These twin changes require significant changes to how we approach the next generation of data management platforms. In this talk, I will discuss some of the directions we are exploring at Microsoft in STB’s Data Platforms Group.
Taming Big Data with Spark and Berkeley Data Analytics Stack (BDAS)
Speaker: Ion Stoica, University of California, Berkeley
Today’s data analytics tools are slow in answering even simple queries, as they typically require sifting through huge amounts of data stored on disk, and are even less suitable for complex computations, such as machine learning algorithms. To address these challenges, for the past four years we have been developing BDAS, an open source data analytics stack. At the core of BDAS is Spark, an in-memory parallel execution engine, which enables us to provide unified support for batch, streaming, and interactive computations, as well as support sophisticated graph based and machine learning algorithms. Today, Spark and other BDAS components are used in production by tens of companies and institutions. In this talk, I’ll present the architecture and the main design decisions we made in Spark, as well our future plans.
Performance Isolation in Multi-Tenant Cloud Data Services
Speaker: Vivek Narasayya, Microsoft Research
Multi-tenancy is essential to increase utilization and reduce operational cost in database-as-a-service platforms such as Microsoft SQL Azure, and “Big Data” platforms such as Hadoop and Cosmos. However, contention for shared resources in a multi-tenant system can result in one tenant’s performance being adversely affected by the workload of other tenants contending for shared resources. Assurances on performance isolation can significantly increase the service quality and the tenants’ experience. Our approach to this problem of providing performance isolation is to enable an abstraction of reservation of key system resources (CPU, I/O, memory, etc.) critical to a tenant’s workload. The major challenges lie in supporting this abstraction without statically allocating resources, and techniques for objectively establishing the service provider’s accountability. In this talk, we outline the key ideas of our approach and show demonstrations of performance isolation in relational database-as-a-service and in Hadoop.