The growing computational demand for training deep neural networks (DNNs) makes it a standard practice to
adopt distributed training. Though existing training systems use multiple devices to achieve high degrees of
data parallelism, linear speedup of the performance of large-scale distributed training cannot be promised. A
major challenge faced by practitioners is that they cannot learn the precise efficiency of the task unless they
deploy the model and profile its performance in a cluster. However, deployment and profiling are tedious and
cost inefficient. We address this problem by introducing Merak, a DAG-based simulator, which vividly replays
the training process and accurately predicts the step time. We draw attention to the communication operations
in distributed training and report two critical problems in existing simulation work. (1) We propose a running
time formulation for all-reduce kernels, which features the cost of data propagation and reduce operation. (2)
We design and train an ML-based prediction model to capture the interference between computation kernels and
all-reduce kernels. We adopt the profile-and-predict approach to derive the step time of a large-scale distributed
task from the knowledge of a small-scale task. We implement Merak for PyTorch with NCCL communication
library and evaluate the performance on Nvidia Ampere A100 clusters. Extensive experiments on various DNN
models show that the average accuracy of Merak’s prediction is up to 98.25%