Title: DataHub: A Hosted Platform for Organizing, Managing, Sharing, Collaborating, and Processing Data
In this talk, I will describe DataHub – a hosted data collaboration platform we are building at MIT. DataHub is a unified data management and collaboration platform. The DataHub platform is (a) a flexible data-store (files, relational databases, extendible to other data-storage backends) with sharing/collaboration, security, and other data management capabilities, managed on behalf of different users/groups, and (b) an app ecosystem which hosts apps for various data-processing activities such as ingestion, curation, integration, discovery, query, analytics, visualization, machine learning, etc. A new application can be written and published to the DataHub App Center using our SDK (thrift-based APIs – can be compiled into any of the 20+ thrift-supported languages). The DataHub users can use any of the apps from the App Center for processing their data as it fits their need. Essentially, DataHub is a unified, managed, collaborative platform for making data-processing easy. I will also discuss some useful data-processing apps we are building for the DataHub platform: a) Distill: a general purpose, example-based data cleaning/extraction tool for converting semi-structured text into a structured table, b) DViz: a simple drag/drop interface for creating visualizations, and c) DataHub Notebook: an IPython extension that enables sophisticated data science directly inside DataHub.
Title -> E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems
On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company’s business success. In addition, many OLTP workloads are heavily skewed to “hot” tuples or ranges of tuples. For example, the majority of NYSE volume involves only 40 stocks. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time. In this talk I will present E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application’s workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies. Our experimental evaluation of E-Store shows the viability of our approach and its efficacy under variations in load across a cluster of machines. Compared to single-tier approaches, E-Store improves throughput by up to 130% while reducing latency by 80%.