We design and implement a performance simulator for DNN models. It converts the
the input training script to a directed-acyclic-graph and emulates the execution of
operations. The simulator works for both data parallelism and pipeline parallelism.
We provide an in-depth analysis on the implementation of NCCL allreduce. By characterizing its behavior, we introduce a network latency predictor to provide
a reliable estimation on communication time for our simulator.
We consider the interference between computation and communication operations
in distributed training. Based on the empirical evidences, we propose a mathematical
formulation used for quantifying the impact to training step time during simulation.
We collect the running time traces of operations and investigate the intra-GPU and inter-GPU execution time variance. We adopt a statistical approach to formulate the execution time variance and incorporate it into our simulator.