ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari; Jeff Rasley; Olatunji Ruwase; Yuxiong He

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari ,
Jeff Rasley ,
Olatunji Ruwase ,
Yuxiong He

May 2020

ArXiv

Download BibTex

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware.

We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world’s largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.

Related Tools

DeepSpeed

February 12, 2020

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 5x Faster Training Minimal Code Change DeepSpeed can train DL models with over a hundred billion parameters on current generation of GPU clusters, while achieving over 5x in system performance compared to the state-of-art. Early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category.

Access

ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed

The latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. With the goal of advancing large model training by improving scale, speed, cost, and usability for model developers across the world, Microsoft made the DeepSpeed library open source in February of 2020.

In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in the ZeRO optimizer that are central to supporting training of 100–200 billion parameter models and higher. In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training.

DeepSpeed can efficiently train models with 100–200 billion parameters up to 10 times faster than state-of-the-art via the use of a memory optimization system called ZeRO (Zero Redundancy Optimizer). ZeRO is a parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), one of the largest publicly known language models at 17 billion parameters.

DeepSpeed recently obtained the fastest BERT training record of 44 minutes on 1024 NVIDIA V100 GPUs. This is a 34% improvement over the best published result, and it does not come at the cost of excessive hardware resources but is a result of improved software efficiency. DeepSpeed can attain a staggering 64 teraflops of single GPU performance on a NVIDIA V100 GPU, which is over 50% of the hardware peak.

Together, you will explore:

DeepSpeed features, optimizations for speed and scale, and a roadmap for the future
How to use DeepSpeed to train your own model and other popular models like BERT and GPT-2
A deep dive into technology behind the ZeRO optimizer and upcoming features
How we achieved the world record for BERT training using this technology

DeepSpeed is a group of system researchers and engineers who are enthusiastic about performance optimization of large-scale systems. Presenters in this webinar include: Principal Research Manager Yuxiong He, researcher Samyam Rajbhandari, researcher Jeff Rasley, and researcher Tunji Ruwase.

Resource list:

DeepSpeed Website (opens in new tab)
DeepSpeed Library (opens in new tab) (GitHub)
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (publication)
ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters (blog)
ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale (blog)
DeepSpeed Fastest Bert deep dive (opens in new tab) (blog)
Turing-NLG: A 17-billion-parameter language model by Microsoft (blog)
AI at Scale (Project Page)
ONNX runtime (opens in new tab) (GitHub)

*This on-demand webinar features a previously recorded Q&A session and open captioning.

Explore more Microsoft Research webinars: https://aka.ms/msrwebinars (opens in new tab)