On Modular Learning of Distributed Systems for Predicting End-to-End Latency

NSDI (USENIX Symposium on Networked Systems Design and Implementation) |

Published by USENIX

An emerging trend in cloud deployments is to adopt machine learning (ML) models to characterize end-to-end system performance, and to subsequently help non-system-experts to tune system configurations. Despite early success, we observe that such methods can incur significant costs when adapting to the deployment dynamics of distributed systems like service scaling-out and replacement. They require hours or even days for data collection and model training; otherwise, the ML models may exhibit unacceptable inaccuracy. We argue that this problem arises from the practice of modeling the entire system using monolithic models. To address the issue, we propose Fluxion, a framework to model end- to-end system latency with modularized learning. Fluxion introduces learning assignment, a new abstraction that allows modeling a single sub-component, instead of the whole system for end-to-end latency. With a consistent interface, multiple heterogeneous learning assignments can be composed into an inference graph to model a complex distributed system on the fly. Changes in a system component only involve updating the corresponding assignment, thus significantly reducing the costs. Using three scenarios with up to 142 microservices on a 100-VM cluster, Fluxion shows a performance modeling MAE (mean absolute error) up to 68.41% lower than monolithic models. This in turn speeds up the 90-percentile end-to-end latency by up to 1.57×. All these achieved under various system deployment dynamics.