DataHub: A Hosted Platform for Organizing, Managing, Sharing, Collaborating, and Processing Data; E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Date

January 21, 2015

Speaker

Rebecca Yale Taft and Anant Bharadwaj

Affiliation

MIT

Overview

Title: DataHub: A Hosted Platform for Organizing, Managing, Sharing, Collaborating, and Processing Data

In this talk, I will describe DataHub – a hosted data collaboration platform we are building at MIT. DataHub is a unified data management and collaboration platform. The DataHub platform is (a) a flexible data-store (files, relational databases, extendible to other data-storage backends) with sharing/collaboration, security, and other data management capabilities, managed on behalf of different users/groups, and (b) an app ecosystem which hosts apps for various data-processing activities such as ingestion, curation, integration, discovery, query, analytics, visualization, machine learning, etc. A new application can be written and published to the DataHub App Center using our SDK (thrift-based APIs – can be compiled into any of the 20+ thrift-supported languages). The DataHub users can use any of the apps from the App Center for processing their data as it fits their need. Essentially, DataHub is a unified, managed, collaborative platform for making data-processing easy. I will also discuss some useful data-processing apps we are building for the DataHub platform: a) Distill: a general purpose, example-based data cleaning/extraction tool for converting semi-structured text into a structured table, b) DViz: a simple drag/drop interface for creating visualizations, and c) DataHub Notebook: an IPython extension that enables sophisticated data science directly inside DataHub.

Title -> E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company’s business success. In addition, many OLTP workloads are heavily skewed to “hot” tuples or ranges of tuples. For example, the majority of NYSE volume involves only 40 stocks. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time. In this talk I will present E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application’s workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies. Our experimental evaluation of E-Store shows the viability of our approach and its efficacy under variations in load across a cluster of machines. Compared to single-tier approaches, E-Store improves throughput by up to 130% while reducing latency by 80%.

Speakers

Rebecca Yale Taft and Anant Bharadwaj

I am a third-year PhD student in the Database Group at MIT, working with Mike Stonebraker. I graduated from Yale University in 2008 with a B.S. in Physics. Before coming to MIT, I spent two years working as a consultant at Exeter Group, and another two years working as a software engineer at Bloomberg LP. My current research interests include elastic scalability of DBMSs as well as multi-tenancy in the cloud.

Anant Bhardwaj is a Ph.D. student in the Computer Science & Artificial Intelligence Laboratory (CSAIL) at MIT, co-advised by David Karger, and Samuel Madden. His primary interest these days is in developing systems and tools for data management. His research projects draw ideas from various fields such as databases, distributed systems, algorithms, machine learning, and human-computer interaction. His current projects are: 1) DataHub: a hosted platform for data management, 2) Distill: a general purpose example-based data cleaning/extraction tool for converting semi-structured text into a structured table, 3) Barista: a distributed, synchronously replicated, fault tolerant, relational data store, and 4) Confer: a tool for conference planning (has been deployed at 13 academic conferences including CHI, CSCW, KDD, ACM MM, SIGMOD, SIGIR, and WSDM; more than 18,000 unique users). He received a M.S. in Computer Science from Stanford University, and a B.E. in Computer Engineering from the University of Pune. At Stanford, he worked in the Human-Computer Interaction (HCI) group with Scott Klemmer, and Jeff Heer.