Resource-Efficient Redundancy for Large-Scale Data Processing and Storage Systems

Large-scale systems are often subject to non-ideal conditions such as failures, stragglers, load imbalance, etc. These issues adversely affect query latency in data-processing systems, and durability and access latency in storage systems. Redundancy (duplication of data and/or queries) is a common approach employed to impart resilience against such adverse effects. In this talk, I will present two sets of results that take fundamentally new approaches to adding redundancy in data processing and storage systems, blending tools from coding theory and machine learning along with systems insights:

(1) A novel learning-and-coding-based resilient computation framework and its application to reducing tail latency in serving neural network models for a variety of tasks such as image classification, speech recognition, and object detection. Our solution is the first to overcome a challenging barrier that limited the applicability of existing coding-based resilient computation approaches to a severely limited class of functions.

(2) A new redundancy-configuration approach for large-scale storage systems that exploits reliability heterogeneity in storage devices to achieve significant cost savings. Our solution contests the widely used static approach to configuring redundancy by proposing a dynamic data-driven approach that tailors redundancy levels to observed failure rates. Using a production data set, we show 11-16% reduction in storage space even in highly-optimized erasure-coded storage systems, translating to significant cost savings in large-scale operations.

[SLIDES]

Speaker Details

Rashmi K. Vinayak is an assistant professor in the Computer Science department at Carnegie Mellon University. She received her PhD in the EECS department at UC Berkeley in 2016, and was a postdoctoral researcher at AMPLab/RISELab and BLISS at UC Berkeley. Her dissertation received the Eli Jury Award 2016 from the EECS department at UC Berkeley for “outstanding achievement in the area of systems, communications, control, or signal processing”.

She is a recipient of the Facebook Communications and Networking Research Award 2018, Google Faculty Research Award 2018, IEEE Data Storage Best Paper and Best Student Paper Awards for the years 2011/2012. She was also a recipient of the Facebook Fellowship 2012-13, the Microsoft Research PhD Fellowship 2013-15, and the Google Anita Borg Memorial Scholarship 2015-16.

Her research interests lie broadly in the areas of computer and networked systems, and information and coding theory. Her current research focus is on addressing reliability, availability, scalability, and performance challenges in data storage and caching systems, in systems for machine learning, and in live video streaming, based on theoretical foundations.

Date:
Speakers:
Rashmi Vinayak
Affiliation:
Carnegie Mellon University

Series: Microsoft Research Talks