Project Fiddle: Fast and Efficient Infrastructure for Distributed Deep Learning
The goal of Project Fiddle is to build efficient systems infrastructure for very fast distributed DNN training. Our goal is to support 100x more efficient training. To achieve this goal, we take a broad view of training: from a single GPU, to multiple GPUs on a machine, all the way to training on large multi-machines clusters. Our innovations cut across the systems stack: the memory subsystem, structuring parallel computation across GPUs and machines, and interconnects between GPUs and across machines.
Our work so far has targeted many different parts of the systems stack (organized as different sub-projects)
- Gist: In Gist, we ask: how far can we push the limits of single-GPU training? Specifically, we explore training larger networks on a single GPU by slashing down training memory footprint.
- PipeDream: Unlike other big-data workloads, DNN training is not naïvely parallelizable because one has to strike a balance between hardware and statistical efficiency. Time to achieve the desired accuracy is what matters. We have designed a new way to systematically parallelize DNN computation to efficiently scale training by combining model parallelism, data parallelism, and pipelining.
- Blink: There are exciting advances in inter-GPU interconnects such as NVLink on a single machine and Infiniband when using GPU-direct RDMA. But these advances bring with them heterogeneity as a challenge for data-transfer protocol developers. Blink is a library targeted to speed up inter-GPU communication in parallel training; it shields developers from interconnect heterogeneity while coming up with transfer schedules that maximize interconnect link capacity.
- Fast, Fault-tolerant, and Elastic Training in Multi-tenant Clusters.
- Our work in Fiddle is grounded in rigorous profiling and benchmarking, while building helpful tools along the way. Our profiling work spans single-GPU training (TBD (Training Benchmark for DNNs)), multi-GPU training, and cluster-wide profiling and characterization across multiple jobs.