Towards Domain-Specific Network Transport for Distributed DNN Training
- Hao Wang ,
- Han Tian ,
- Jingrong Chen ,
- Xinchen Wan ,
- Jiacheng Xia ,
- Gaoxiong Zeng ,
- Wei Bai ,
- Junchen Jiang ,
- Yong Wang ,
- Kai Chen
NSDI'24 |
The nature of machine learning (ML) applications exposes rich characteristics to underlying network transport, yet little work has been done so far to systematically exploit these properties in transport layer design. This paper takes the initiative to pursue a domain-specific network transport, called MLT, for distributed DNN training that fully embraces several unique characteristics of machine learning.
At its heart, MLT employs three simple-yet-effective techniques to form a 3-step progressive scheme against long tail latency caused by transient packet drops and queueing. First, it leverages the independencies among gradient updates to enable per-packet load balancing to minimize network hotspots without worrying about packet re-ordering. Then, if hotspot arises, it performs priority queueing/dropping by differentiating gradients based on their layers and magnitudes to optimize model convergence and accuracy. Lastly, if drop occurs, it enables bounded-loss tolerance—a certain amount of gradient losses tolerated by the DNN training without affecting the final model performance.
MLT is readily deployable with commodity switches and imposes minimal modifications on popular DNN training libraries (e.g., TensorFlow, MXNet and PyTorch) and communication routines (e.g., PS and Ring All-reduce). We show, via both testbed experiments and simulations, that MLT can effectively optimize network tail latency and achieve up to 62.2% better end-to-end training time over prior work.