ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Diandian Gu; Yihao Zhao; Yinmin Zhong; Yifan Xiong; Zhenhua Han; Peng Cheng; Fan Yang; Gang Huang; Xin Jin; Xuanzhe Liu

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Diandian Gu ,
Yihao Zhao ,
Yinmin Zhong ,
Yifan Xiong ,
Zhenhua Han ,
Peng Cheng ,
Fan Yang ,
Gang Huang ,
Xin Jin ,
Xuanzhe Liu

ASPLOS 2023 | March 2023

Download BibTex

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep learning. ElasticFlow provides a serverless interface with two distinct features: (i) users specify only the DNN model and hyperparameters for a job, but not the number of GPUs; (ii) users specify the deadline for a job, but not the amount of time to occupy GPUs. In contrast to existing server-centric platforms, ElasticFlow provides performance guarantees in terms of meeting deadlines and alleviates users from tedious, low-level, and manual resource management.

The characteristics of distributed training introduce two challenges. First, the training throughput scales non-linearly with the number of GPUs. Second, the scaling efficiency is affected by worker placement. To address these challenges, we propose Minimum Satisfactory Share to capture the resource usage of training jobs to meet deadlines, and ElasticFlow performs admission control based on it. We develop a greedy algorithm that dynamically allocates resources to admitted jobs based on diminishing returns. We apply buddy allocation to worker placement to eliminate the effect of topology. Evaluation results on a cluster of 128 GPUs show that ElasticFlow increases the number of jobs that can meet their deadlines by 1.45–7.65x compared to existing solutions.