Machine Learning Systems for Highly Distributed and Rapidly Growing Data

  • Kevin Hsieh | Carnegie Mellon University

The usability and practicality of machine learning are largely influenced by two critical factors: low latency and low cost. However, achieving low latency and low cost is very challenging when machine learning depends on real-world data that are rapidly growing and highly distributed (e.g., training a face recognition model using pictures stored across many data centers globally).

In this talk, I will present my work on building low-latency and low-cost machine learning systems that enable efficient processing of real-world, large-scale data. I will describe a system-level approach that is inspired by the general characteristics of machine learning algorithms, machine learning model structures, and machine learning training/serving data. In line with this approach, I will first present a system that provides both low-latency and low-cost machine learning serving (inferencing) over large-scale continuously-growing datasets (e.g. videos). Shifting the focus to model training, I will then present a system that makes machine learning training over geo-distributed datasets as fast as training within a single data center. Finally, I will discuss our ongoing efforts to tackle a fundamental and largely overlooked problem: machine learning training over skewed data partitions (e.g., facial images collected by cameras in different countries).

[SLIDES]

Speaker Details

Kevin Hsieh is a PhD candidate at Carnegie Mellon University, where he works with Phil Gibbons and Onur Mutlu. His research interest spans machine learning, systems, and computer architecture, with a recent focus on high-performance machine learning systems for real-world, large-scale data. His work has appeared in top system/architecture venues such as OSDI, NSDI, ISCA, and MICRO. Before pursuing his PhD, he was an engineering manager in Mediatek, Taiwan, where he led the development of system architectures for mobile SoCs.