DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication


By , Corporate Vice President of Engineering

DeepSpeed ZeRO++ project highlights graphic
Figure 1: Picture of ZeRO++ project highlights. Left top subfigure shows ZeRO++ reduce communication volume by 4x compared with ZeRO stage 3. Right top subfigure shows ZeRO++ performance on RLHF model training, where ZeRO++ achieves 1.3x speedup for RLHF training and 2.x speedup for token generation.

Large AI models are transforming the digital world. Generative language models like Turing-NLG, ChatGPT, and GPT-4, powered by large language models (LLMs), are incredibly versatile, capable of performing tasks like summarization, coding, and translation. Similarly, large multimodal generative models like DALL·E, Microsoft Designer, and Bing Image Creator can generate art, architecture, videos, and other digital assets, empowering content creators, architects, and engineers to explore new frontiers of creative productivity.

However, training these large models requires considerable memory and computing resources across hundreds or even thousands of GPU devices. For instance, training the Megatron-Turing NLG 530B model utilized over 4,000 NVidia A100 GPUs. Efficiently leveraging these resources requires a complex system of optimizations to partition the models into pieces that fit into the memory of individual devices, and to efficiently parallelize the computing across these devices. At the same time, to make large model training easily accessible to the deep learning community, these optimizations must be easy to use.

The ZeRO family of optimizations (opens in new tab) from DeepSpeed offers a powerful solution to these challenges, and has been widely used to train large and powerful deep learning models TNLG-17B, Bloom-176B, MPT-7B, Jurrasic-1, etc. Despite its transformative capabilities, there are critical scenarios where ZeRO incurs high data transfer overhead across GPUs, making it challenging to achieve high training efficiency. This happens specifically when a) training on a large number of GPUs relative to the global batch size, which results in small per-GPU batch size, requiring frequent communication, or b) training on low-end clusters, where cross-node network bandwidth is limited, resulting in high communication latency. In these scenarios, ZeRO’s ability to offer accessible and efficient training is limited.

Spotlight: Event

Microsoft at CVPR 2024

Microsoft is a proud sponsor and active participant of CVPR 2024, which focuses on advancements in computer vision and pattern recognition.

To address these limitations, we are releasing ZeRO++, a system of communication optimization strategies built on top of ZeRO to offer unmatched efficiency for large model training, regardless of batch size limitations or cross-device bandwidth constraints. ZeRO++ leverages quantization, in combination with data, and communication remapping, to reduce total communication volume by 4x compared with ZeRO, without impacting model quality. This has two key implications:

  • ZeRO++ accelerates large model pre-training and fine-tuning
    • Small batch-size per GPU: Whether pre-training large models on thousands of GPUs or fine-tuning them on hundreds or even dozens of GPUs, when batch-size per GPU is small, ZeRO++ offers up to 2.2x higher throughput compared to ZeRO, directly reducing training time and cost.
    • Low-bandwidth clusters: ZeRO++ enables low-bandwidth clusters to achieve similar throughput as those with 4x higher bandwidth. Therefore, ZeRO++ makes efficient large model training accessible across a wider variety of clusters.
  • ZeRO++ accelerates ChatGPT-like model training with RLHF

    While ZeRO++ was designed primarily for training, its optimizations automatically also apply to ZeRO-Inference (opens in new tab), as the communication overheads are common to training and inference with ZeRO. Consequently, ZeRO++ improves efficiency of workloads like reinforcement learning from human feedback (RLHF) used in training dialogue models, which combines both training and inference.

    Through integration with DeepSpeed-Chat (opens in new tab), ZeRO++ can improve the generation phase of RLHF training by up to 2x and reinforcement learning training phase by up to 1.3x compared to original ZeRO.

Next, we’ll take a deeper dive into ZeRO and its communication overheads and discuss the key optimizations in ZeRO++ for addressing them. Then we’ll demonstrate the impact of ZeRO++ on training throughput for different model sizes, batch sizes, and bandwidth constraints. We’ll also discuss how ZeRO++ applies to DeepSpeed-Chat for accelerating the training of dialogue models using RLHF.

Deep dive into ZeRO++

Figure 2: ZeRO optimizer workflow

ZeRO is a memory efficient variation of data parallelism where model states are partitioned across all the GPUs, instead of being replicated, and reconstructed using gather/broadcast-based communication collectives on the fly during training. This allows ZeRO to effectively leverage the aggregate GPU memory and compute across all devices, while offering simplicity and ease-of-use of data-parallel training.

Assume the model size as M. During the forward pass, ZeRO conducts all-gather/broadcast operations to collect parameters for each model layer right before it is needed (in total of size M). In the backward pass, ZeRO adopts a similar communication pattern for parameters at each layer to compute its local gradients (in total of size M). In addition, ZeRO averages and partitions each local-gradient immediately after it is computed using a reduce or reduce-scatter communication collective (in total of size M). In total, ZeRO has a communication volume of 3M, spread evenly across two all-gather/broadcast and one reduce-scatter/reduce operation.

To reduce these communication overheads, ZeRO++ has three sets of communication optimizations, targeting each of the above-mentioned three communication collectives, respectively:

DeepSpeed ZeRO++ Quantization graphic
Figure 3: Block-based quantization in qwZ. The figure shows block quantization has better data precision compared with basic quantization.

Quantized weight communication for ZeRO (qwZ)

First, to reduce parameter communication volume during all-gather, we adopt quantization on weights to shrink down each model parameter on the fly from FP16 (two bytes) to INT8 (one byte) data type before communicating, and dequantize weights after the communication. However, naively conducting quantization on weights may reduce model training accuracy. To preserve decent model training precision, we adopt block-based quantization, which conducts independent quantization on each subset of model parameters. There is no existing implementation for high performance, block-based quantization. Thus, we implement highly optimized quantization CUDA kernels from scratch that is 3x more accurate and 5x faster compared with basic quantization.

DeepSpeed ZeRO++ weight partitions graphic
Figure 4: Hierarchical weights partition in hpZ. The figure shows hpZ holds secondary model partitions on each GPU, compared to zero-3 only holding primary model partitions.

Hierarchical weight partition for ZeRO (hpZ)

Second, to reduce communication overhead of all-gather on weights during backward pass, we trade GPU memory for communication. More specifically, instead of spreading whole model weights across all the machines as in ZeRO, we maintain a full model copy within each machine. At the expense of higher memory overhead, this allows us to replace the expensive cross-machine all-gather/broadcast on weights with intra-machine all-gather/broadcast, which is substantially faster due to much higher intra-machine communication bandwidth.

DeepSpeed ZeRO++ animated graphic
Figure 5: End to end workflow of qgZ. This animation figure shows whole workflow of qgZ component, which includes tensor slice reordering, intra-node quantization, intra-node all-to-all communication, intra-node dequantization, intra-node reduction, inter-node quantization, inter-node all-to-all communication, inter-node dequantization, inter-node reduction.

Quantized gradient communication for ZeRO (qgZ)

Third, reducing communication cost of gradients using reduce-scatter is even more challenging. Directly applying quantization to reduce communication volume is infeasible. Even if we incorporate block-based quantization as low-precision, the gradient reduction accumulates and amplifies quantization error. To address this, we only quantize gradients before communication, but dequantize them to full precision before any reduction operation. To do this efficiently, we invented an all-to-all-based, novel quantized gradient communication paradigm called qgZ, which is functionally equivalent to compressed reduce-scatter collective operation.

qgZ is designed to solve two challenges: i) overcome significant accuracy loss that would result from low-precision reduction if we were to simply implement reduce-scatter in INT4/INT8, and ii) avoid accuracy degradation and significant latency overhead that would result from a long sequence of quantization and dequantization steps that would be needed by traditional approach to reduce-scatter that are ring- or tree-based, even if we did the reductions in full-precision. Instead of using a ring- or tree-based reduce-scatter algorithm, qgZ is based on a novel hierarchical all-to-all approach.

There are three major steps in qgZ: i) gradient slice reordering, ii) intra-node communication and reduction, and iii) inter-node communication and reduction. First, before any communication happens, we slice the gradient and do tensor slice reordering to guarantee the final gradient placement (i.e., green chunks in Figure 5) is correct on each GPU at the end of the communication. Second, we quantize the reordered gradient slices, conduct all-to-all communication within each node, dequantize the received gradient slices from the all-to-all, and do local reductions. Third, we quantize the local reduced gradients again, conduct inter-node all-to-all communication, dequantize the received gradients again, and compute the final high-precision gradient reduction to get the results as green chunks in Figure 5.

The reason for this hierarchical approach is to reduce cross-node communication volumes. More precisely, given N GPUs per node, model size of M and quantization ratio of Z, single hop all-to-all will generate M*N/Z cross-node traffic. In comparison, with this hierarchical approach, we reduce the cross-node traffic of each GPU from M/Z to M/(Z*N). Thus, the total communication volume is reduced from M*N/Z to M*N/(Z*N) = M/Z. We further optimize end-to-end latency of qgZ by overlapping intra-node and inter-node communication as well as fusing the CUDA kernel for (tensor slice reordering + intra-node quantization) and (intra-node dequantization+ intra-node reduction + inter-node quantization).

Communication VolumeForward all-gather on weightsBackward all-gather on weightsBackward reduce-scatter on gradientsTotal

Communication volume reduction

By incorporating all three components above, we reduce the cross-node communication volume from 3M down to 0.75M. More specifically, we reduce forward all-gather/broadcast on model weights from M to 0.5M using qwZ. We eliminate the cross-node all-gather during backward propagation using hpZ, reducing the communication from M to 0. Finally, we reduce cross-node reduce-scatter communication during backward-pass from M to 0.25M using qgZ.

ZeRO++ accelerates LLM training

Here we show our evaluation results of ZeRO++ with real-world LLM training scenarios in 384 Nvidia V100 GPUs.

DeepSpeed ZeRO++ bar chart
Figure 6: Throughput comparison of zero++ v.s. zero with 400 Gbps interconnect. Figure shows zero++ can achieve up to 1.56x speedup with 1k token per GPU, while achieving 1.41x speedup with 2k token per GPU.

High efficiency with small batch per-GPU

High-bandwidth cluster: As shown in Figure 6, we first show ZeRO++ throughput improvement over ZeRO for different model sizes and micro batch sizes with 400Gbps cross-node interconnects using 4x Infiniband (IB), each running at 100Gbps. With 1k token per GPU, ZeRO++ achieves 28% to 36% throughput improvement over ZeRO-3. For 2k micro batch sizes, ZeRO++ achieves 24% to 29% throughput gain over ZeRO-3.

DeepSpeed ZeRO++ bar chart
Figure 7: Throughput comparison of ZeRO++ v.s ZeRO with 100Gbps interconnect. Figure shows ZeRO++ achieves 2.21x speedup compared to ZeRO in 1k token per GPU cases, while achieving 1.77x speedup in 2k token per GPU cases.

Low-bandwidth cluster: In low network environments like a 100Gbps network, ZeRO++ performs significantly better than ZeRO-3. As shown in Figure 7, ZeRO++ achieves up to 2.2x speedup in end-to-end throughput, compared to ZeRO-3. On average, ZeRO++ achieves around 2x speedup over ZeRO-3 baseline.

DeepSpeed ZeRO++ bar chart
Figure 8: ZeRO++ with low bandwidth interconnect achieves similar throughput as ZeRO with high bandwidth interconnect. Figure shows in both 18B and 138B model sizes, ZeRO++ with low bandwith network achieves similar throughput compared to ZeRO with high bandwidth interconnect.

Enabling efficiency equivalence between high and low bandwidth clusters

In addition, ZeRO ++ can achieve comparable system throughput in a low-bandwidth cluster compared with ZeRO in a much higher bandwidth setting. As shown in Figure 8, for both 18B and 138B models, ZeRO++ with 200Gbps cross-node link can reach similar TFLOPs compared with ZeRO-3 in 800 Gbps cross-node link settings. 

Given the excellent scalability of ZeRO++, we envision ZeRO++ as the next generation of ZeRO for training large AI models.

ZeRO++ for RLHF training with DeepSpeed-Chat

RLHF training background

ChatGPT-like models are powered by LLMs and fine-tuned using RLHF (opens in new tab). RLHF consists of generation (inference) phases and training phases. During the generation phase, the actor model takes a partial conversation as input and generates responses using a sequence of forward passes. Then during the training phase, the critic model ranks the generated responses by quality, providing reinforcement signals for the actor model. The actor model is fine-tuned using these rankings, enabling it to generate more accurate and appropriate responses in subsequent iterations.

RLHF training brings a non-trivial amount of memory pressure as it utilizes four models (actor, reference, critic, reward). Low-rank adaptation (LoRA) is employed to address the memory pressure of RLHF. LoRA freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, significantly reducing the number of trainable parameters. LoRA speeds up RLHF by reducing memory usage, allowing for larger batch sizes, and thus greatly improves throughput.

DeepSpeed-Chat with ZeRO++ for RLHF training

DeepSpeed ZeRO++ bar chart
Figure 9: ZeRO++ speedup in RLHF training. Left figure shows ZeRO++ achieves 1.26x speedup for RLHF step1 training. Right figure shows ZeRO++ achieves up to 2.25x speedup in RLHF step3 token generation.

RLHF with LoRA is a unique application for ZeRO++ since most model weights are frozen. This means ZeRO++ can keep these frozen weights quantized in INT4/8 instead of storing them in FP16 and quantizing them before each communication operation. The dequantization after communication is still done to get the weights ready for computation, but the dequantized weights are simply discarded after computation.  

Using ZeRO++ for RLHF training in this way reduces both memory usage and communication volume. This boosts training throughput by reducing communication as well as by enabling larger batch sizes due to reduced memory usage. During the generation phase, ZeRO++ uses hpZ to keep all weight communication within each node to utilize the higher intranode communication bandwidth with reduced communication volume, further improving the generation throughput.

ZeRO++ is integrated into DeepSpeed-Chat to power RLHF training of ChatGPT-like models. In Figure 9, we compare RLHF generation throughput for different sizes of actor models comparing ZeRO with ZeRO++ for 30B and 66B actor models on 32 V100 GPUs. The results show that ZeRO++ enables up to 2.25x better RLHF generation throughput than ZeRO. We also present the speedup for the training phase on 16 V100 GPUs, where ZeRO++ achieves 1.26x better throughput than ZeRO as a result of lower communication and larger batch sizes enabled by ZeRO++.

Release: Try DeepSpeed ZeRO++ today

We are super excited to release DeepSpeed ZeRO++ and make it available for anyone in the AI community. To get started, please visit our GitHub page for LLM training (opens in new tab). ZeRO++ for DeepSpeed-Chat will be released in the coming weeks.

DeepSpeed-ZeRO++ is part of the DeepSpeed ecosystem. To learn more, please visit our website (opens in new tab), where you’ll find detailed blog posts, tutorials, and helpful documentation.

For the latest DeepSpeed news, please follow us on social media:

DeepSpeed welcomes your contributions. We encourage you to report issues, contribute PRs, and join discussions on the DeepSpeed GitHub page. Please see our contributing guide for more details. We are open to collaborations with universities, research labs, and companies. For such requests (and other requests unsuitable for GitHub), please directly email to


This project was made possible by the contributions of the following people from the DeepSpeed Team:

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Ammar Ahmad Awan, Jeff Rasley, Michael Wyatt, Yuxiong He (team lead)

Related publications

Continue reading

See all blog posts

Research Areas

Related tools

Related projects