ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed

The latest trend in AI is that larger natural language models provide better accuracy; however, larger models are difficult to train because of cost, time, and ease of code integration. With the goal of advancing large model training by improving scale, speed, cost, and usability for model developers across the world, Microsoft made the DeepSpeed library open source in February of 2020.

In this webinar, the DeepSpeed team will discuss what DeepSpeed is, how to use it with your existing PyTorch models, and advancements in the ZeRO optimizer that are central to supporting training of 100–200 billion parameter models and higher. In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training.

DeepSpeed can efficiently train models with 100–200 billion parameters up to 10 times faster than state-of-the-art via the use of a memory optimization system called ZeRO (Zero Redundancy Optimizer). ZeRO is a parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), one of the largest publicly known language models at 17 billion parameters.

DeepSpeed recently obtained the fastest BERT training record of 44 minutes on 1024 NVIDIA V100 GPUs. This is a 34% improvement over the best published result, and it does not come at the cost of excessive hardware resources but is a result of improved software efficiency. DeepSpeed can attain a staggering 64 teraflops of single GPU performance on a NVIDIA V100 GPU, which is over 50% of the hardware peak.

Together, you will explore:

DeepSpeed features, optimizations for speed and scale, and a roadmap for the future
How to use DeepSpeed to train your own model and other popular models like BERT and GPT-2
A deep dive into technology behind the ZeRO optimizer and upcoming features
How we achieved the world record for BERT training using this technology

DeepSpeed is a group of system researchers and engineers who are enthusiastic about performance optimization of large-scale systems. Presenters in this webinar include: Principal Research Manager Yuxiong He, researcher Samyam Rajbhandari, researcher Jeff Rasley, and researcher Tunji Ruwase.

Resource list:

*This on-demand webinar features a previously recorded Q&A session and open captioning.

Explore more Microsoft Research webinars: https://aka.ms/msrwebinars (opens in new tab)

Date:: August 6, 2020
Speakers:: Yuxiong He, Samyam Rajbhandari, Jeff Rasley, Tunji Ruwase
Affiliation:: Microsoft Research

- Jeff Rasley
  
  Senior Research SDE
- Olatunji Ruwase
  
  Team Lead
- Samyam Rajbhandari
  
  Principal Research Engineer
- Yuxiong He
  
  Partner Research Manager
Research Area
Research Lab
- Microsoft Research Lab - Redmond
Project
- AI at Scale
Publication
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Download
- ONNX Runtime
- DeepSpeed

ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed

Speakers

Jeff Rasley

Olatunji Ruwase

Samyam Rajbhandari

Yuxiong He

Related Links

Research Area

Research Lab

Project

Publication

Download