{"id":738250,"date":"2021-04-19T09:08:22","date_gmt":"2021-04-19T16:08:22","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=738250"},"modified":"2022-05-19T11:43:03","modified_gmt":"2022-05-19T18:43:03","slug":"zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training\/","title":{"rendered":"ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"ZeRO-Infinity obtains excellent training efficiency\u2014over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/1400x788_deepspeed_update_figure_nologo_Still-1-scaled.jpg\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/1400x788_deepspeed_update_figure_nologo_Still-1-scaled.jpg\" alt=\"ZeRO-Infinity obtains excellent training efficiency\u2014over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.\"\/><\/a><\/figure>\n\n\n\n<p>Since the DeepSpeed optimization library was introduced last year, it has rolled out numerous novel optimizations for training large AI models\u2014improving scale, speed, cost, and usability. As large models have quickly evolved over the last year, so too has DeepSpeed. Whether <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters\/\">enabling researchers to create the 17-billion-parameter Microsoft Turing Natural Language Generation (Turing-NLG)<\/a> with state-of-the-art accuracy, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale\/\">achieving the fastest BERT training record<\/a>, or <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/\">supporting 10x larger model training using a single GPU<\/a>, DeepSpeed continues to tackle challenges in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-at-scale\/\">AI at Scale<\/a> with the latest advancements for large-scale model training. Now, the novel memory optimization technology ZeRO (Zero Redundancy Optimizer), included in DeepSpeed, is undergoing a further transformation of its own. The improved ZeRO-Infinity offers the system capability to go beyond the GPU memory wall and train models with tens of trillions of parameters, an order of magnitude bigger than state-of-the-art systems can support. It also offers a promising path toward training 100-trillion-parameter models.<\/p>\n\n\n\n<p><strong>ZeRO-Infinity at a glance: <\/strong>ZeRO-Infinity is a novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs. It powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory (GPU, CPU, and Non-Volatile Memory express or NVMe for short). Learn more in our paper,\u202f\u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-infinity-breaking-the-gpu-memory-wall-for-extreme-scale-deep-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning<\/a>.\u201d The highlights of&nbsp;ZeRO-Infinity include:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Offering the system capability to train a model with over 30 trillion parameters on 512 NVIDIA V100 Tensor Core GPUs, 50x larger than state of the art.&nbsp;<\/li><li>Delivering excellent training efficiency and superlinear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU\/NVMe memory bandwidths and CPU compute, offering over 25 petaflops of sustained throughput on 512 NVIDIA V100 GPUs.<\/li><li>Furthering the mission of the DeepSpeed team to democratize large model training by allowing data scientists with <em>a single GPU<\/em> to fine-tune models larger than Open AI GPT-3 (175 billion parameters).<\/li><li>Eliminating the barrier to entry for large model training by making it simpler and easier\u2014ZeRO-Infinity scales beyond a trillion parameters without the complexity of combining several parallelism techniques and without requiring changes to user codes. To the best of our knowledge, it\u2019s the only parallel technology to do this.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-video alignwide\"><video autoplay controls loop src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/1400x788_deepspeed_nologo-1.mp4\"><\/video><figcaption>The video above shows how <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/zero-inf\">ZeRO-Infinity<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> efficiently leverages GPU, CPU, and NVMe altogether by 1) partitioning each model layer across all data parallel processes, 2) placing the partitions on the corresponding data parallel NVMe devices, and 3) coordinating the data movement needed to compute forward\/backward propagation and weight updates on the data parallel GPUs and CPUs, respectively.<\/figcaption><\/figure>\n\n\n\n<p>We are also pleased to announce&nbsp;DeepSpeed\u2019s&nbsp;integration with&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/machine-learning\/\" target=\"_blank\" rel=\"noopener noreferrer\">Azure Machine Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;and&nbsp;open-source solutions. The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/resource-curated-environments#deepspeed\">DeepSpeed curated environment<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in Azure Machine Learning makes it easier for users to get started on Azure<strong>.<\/strong>&nbsp;DeepSpeed&nbsp;is now integrated in&nbsp;<strong>Hugging Face<\/strong>&nbsp;v4.2 and&nbsp;<strong>PyTorch&nbsp;Lightning<\/strong>&nbsp;v1.2. Hugging Face and&nbsp;PyTorch&nbsp;Lightning users can easily accelerate their models with&nbsp;DeepSpeed&nbsp;through a simple \u201cdeepspeed\u201d flag!<\/p>\n\n\n\n<h2 id=\"addressing-the-needs-of-large-model-training-now-and-into-the-future-with-zero-infinity\">Addressing the needs of large model training now and into the future with ZeRO-Infinity<\/h2>\n\n\n\n<p>In the last three years, the largest trained dense model has grown over 1,000x, from a hundred million parameters in the pre-BERT era to over a hundred billion parameters now. However, in the same duration, single GPU memory has only increased by 5x (16 GB to 80 GB). Therefore, the growth in model size has been made possible mainly through advances in system technology for training large DL models, with parallel technologies such as model parallelism, pipeline parallelism, and ZeRO allowing large models to fit in aggregate GPU memory, creating a path to training larger and more powerful models.<\/p>\n\n\n\n<p>The state-of-the-art in large model training technology is 3D parallelism. It combines model parallelism (tensor slicing) and pipeline parallelism with data parallelism in complex ways to efficiently scale models by fully leveraging the aggregate GPU memory and compute of a cluster. 3D parallelism has been used in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/#toc-heading-0\">DeepSpeed<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/NVIDIA\/Megatron-LM\">NVIDIA Megatron-LM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, among other frameworks.<\/p>\n\n\n\n<p>Despite the incredible capabilities of 3D parallelism for large model training, we are now arriving at the GPU memory wall. The aggregate GPU memory is simply not large enough to support the growth in model size. Even with the newest NVIDIA A100 GPUs, which have 80 GB of memory, 3D parallelism requires 320 GPUs just to fit a trillion-parameter model for training. Furthermore, 3D parallelism requires significant code refactoring from data scientists, creating a large barrier to entry. Three questions arise:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Looking ahead, how do we <strong><em>support the next 1,000x growth in model size,<\/em><\/strong> going from models like GPT-3 with 175 billion parameters to models with hundreds of trillions of parameters?<\/li><li>Focusing on the present, how can we make the <strong><em>large models of today accessible to more data scientists<\/em><\/strong> who may not have access to hundreds to GPUs currently required to fit these models?<\/li><li>&nbsp;Can we <strong><em>make large model training easier<\/em><\/strong> by eliminating this need for model refactoring?<\/li><\/ul>\n\n\n\n<p>Today, we take a leap forward from 3D parallelism by introducing ZeRO-Infinity, a novel system capable of addressing all the above-mentioned challenges of large model training. ZeRO-Infinity extends the ZeRO family of technology with new innovations in data mapping and high-performance heterogeneous memory access, which allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth.<\/p>\n\n\n\n<p>ZeRO-Infinity can also train these models without the need to combine multiple forms of parallelism in 3D parallelism. It does so via a novel memory-centric computation-tiling approach aimed at reducing GPU memory requirements of large individual layers that would otherwise require model parallelism (tensor slicing) to fit the model in GPU memory. In addition, ZeRO-Infinity makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, virtually eliminating the need for any model refactoring even when scaling to trillions of parameters. Last but not least, ZeRO-Infinity offers a powerful compute-and-communication-overlap engine designed to push training efficiency to the limits by hiding as much communication latency as possible.<\/p>\n\n\n\n<p>With all these innovations, ZeRO-Infinity redefines the capabilities of a DL system, offering <strong>unprecedented model scale<\/strong> that is <strong>accessible<\/strong> and <strong>easy to use<\/strong> while achieving <strong>excellent training efficiency<\/strong>.<\/p>\n\n\n\n<h2 id=\"unprecedented-model-scale-train-30-trillion-parameter-models-on-512-gpus\">Unprecedented model scale: Train 30-trillion-parameter models on 512 GPUs<\/h2>\n\n\n\n<p>ZeRO-Infinity offers a leap of orders of magnitude in DL training system technology, opening a path to supporting the next <strong>1,000x<\/strong> increase in model scale by efficiently exploiting the heterogeneous memory systems on current and future generations of hardware. It runs a model with <strong>over a trillion parameters on a single NVIDIA DGX-2 node <\/strong>and <strong>over 30 trillion parameters on 32 nodes (512 GPUs).<\/strong> With a hundred DGX-2 nodes in a cluster, we project ZeRO-Infinity can train models with over <strong>a<\/strong> <strong>hundred trillion<\/strong> <strong>parameters<\/strong>. (see Figure 1 for details).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig1_DS_UpdatedHighRes.jpg\" alt=\"Figure 1: Comparing model scale between 3D parallelism and ZeRO-Infinity. Experiments are performed on GPU clusters using NVIDIA DGX-2 16-GPU systems (nodes). The model scales up to 32 trillion parameters on 512 V100 GPUs (32 DGX-2 nodes) based on measured runs, while the number of parameters on 64 and 128 DGX-2 nodes are based on projections.\" width=\"814\" height=\"427\"\/><figcaption>Figure 1: Comparing model scale between 3D parallelism and ZeRO-Infinity. Experiments are performed on GPU clusters using NVIDIA DGX-2 16-GPU systems (nodes). The model scales up to 32 trillion parameters on 512 V100 GPUs (32 DGX-2 nodes) based on measured runs, while the number of parameters on 64 and 128 DGX-2 nodes are based on projections.<\/figcaption><\/figure><\/div>\n\n\n\n<p>To enable model training at this scale, ZeRO-Infinity extends the ZeRO family of technology with distinct innovations targeting different memory bottlenecks.<\/p>\n\n\n\n<p><em>1. Stage 3 of ZeRO (ZeRO-3)<\/em> allows for removing all memory redundancies in data-parallel training by partitioning model states across data-parallel processes.<\/p>\n\n\n\n\n\n<p>This first piece in&nbsp;ZeRO-Infinity represents the ultimate set of memory optimization&nbsp;in the original&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO<\/a>&nbsp;paper.&nbsp;&nbsp;<\/p>\n\n\n\n<p>ZeRO is a family of memory optimization technologies for large-scale distributed deep learning. Unlike data parallelism (which is efficient but can only support a limited model size), model parallelism\/tensor slicing (which can support larger model sizes but adds communication overhead that limits efficiency), or pipeline parallelism (which can be efficient but requires significant model code refactoring), ZeRO allows fitting larger models in memory without requiring code refactoring while remaining very efficient. ZeRO does so by eliminating the memory redundancy that is inherent in data parallelism while limiting the communication overhead to a minimum.<\/p>\n\n\n\n<p>ZeRO&nbsp;removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig2_DS_HighRes.jpg\" alt=\"Figure 2: Memory savings and communication volume for the three stages of ZeRO compared with standard data parallel baseline. As a specific example, we show the memory consumption for a 7.5B parameter model training in mixed precision using Adam optimizer on 64 GPUs.\" width=\"821\" height=\"427\"\/><figcaption>Figure 2: Memory savings and communication volume for the three stages of ZeRO compared with standard data parallel baseline. As a specific example, we show the memory consumption for a 7.5B parameter model training in mixed precision using Adam optimizer on 64 GPUs.<\/figcaption><\/figure><\/div>\n\n\n\n<p>There are three stages in&nbsp;ZeRO&nbsp;corresponding to three model states, as shown in Figure&nbsp;2: the first stage (ZeRO-1) partitions only the optimizer states, the second stage (ZeRO-2) partitions both the optimizer states and the gradients, and the final stage (ZeRO-3) partitions all three model states (for more details see the&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">ZeRO&nbsp;paper<\/a>).<\/p>\n\n\n\n<p>During the training, ZeRO-3 ensures that the parameters required for the forward or backward pass of an operator are available right before its execution by issuing communication collective operations, such as broadcast or all-gather. After the execution of the operator, ZeRO-3 also removes the parameters as they are no longer needed until the next forward or backward pass of the operator. Additionally, during the parameter update phase of training, ZeRO-3 ensures that each data-parallel process only updates the optimizer states corresponding to the parameters that it owns. Therefore, ZeRO-3 can keep all the model states partitioned throughout the training except for the parameters that are required by the immediate computation. By leveraging ZeRO-3, ZeRO-Infinity can exploit the aggregate GPU memory available on a cluster to fit the model states. ZeRO-3 alone supports a trillion parameters with 1,024 NVIDIA V100 GPUs.<\/p>\n\n\n\n\n\n<p><em>2. Infinity Offload Engine, <\/em>a novel data offloading library, allows for fully exploiting modern heterogeneous memory architectures by offloading partitioned model states to CPU or NVMe device memory, which are much bigger than GPU memory.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig3_DS_HighRes.jpg\" alt=\"Figure 3: Breakdown of the total memory\/storage available on a single NVIDIA DGX-2 system. It has 3x CPU memory and over 50x NVMe storage compared to GPU memory.\" width=\"814\" height=\"198\"\/><figcaption>Figure 3: Breakdown of the total memory\/storage available on a single NVIDIA DGX-2 system. It has 3x CPU memory and over 50x NVMe storage compared to GPU memory.<\/figcaption><\/figure><\/div>\n\n\n\n\n\n<p><strong>Infinity Offload Engine: <\/strong>State-of-the-art DL training systems, such as 3D parallelism, are bottlenecked by the aggregate GPU memory. However, modern GPU clusters have 2\u20133x more total CPU memory than total GPU memory, and a whopping 50x more total NVMe memory (see Figure 3 for details). Furthermore, the existing NVMe technology allows for over 25 GB\/sec of achievable read\/write speeds per DGX-2 node, comparable to PCIe 4.0 links. While this is nowhere close to the peak GPU memory bandwidth (1 TB\/sec), we realized that we can fully exploit these memories to achieve extreme model scales, without being bottlenecked by their bandwidth, through careful design and optimizations.<\/p>\n\n\n\n<p>The new Infinity Offload Engine is the brainchild of this realization. It is a novel data transfer library consisting of carefully crafted high-performance data movement kernels for training DL models. It can fully exploit both the CPU memory and the NVMe memory to offload the model states partitioned by ZeRO-3, allowing ZeRO-Infinity to create and train models at an unprecedented scale. ZeRO-Infinity can train a trillion-parameter model on a single GPU within a DGX-2 node (100x larger than the current start of the art), or it can train models with over 30 trillion parameters on 32 such nodes (50x larger than 3D parallelism on the same number of nodes). Refer back to Figure 1 for details.<\/p>\n\n\n\n\n\n<p><em>3. Activation checkpointing with CPU offload<\/em> allows for reducing activation memory footprint, which can become the memory bottleneck on the GPU after the memory required by the model states are addressed by ZeRO-3 and the Infinity Offload Engine.<\/p>\n\n\n\n\n\n<p><strong>Activation checkpointing with CPU offload: <\/strong>Models with over tens of billions of parameters require a significant amount of memory for storing activations; memory beyond what is available on a single GPU. To avoid running out of memory, we can use activation checkpointing, where instead of storing all activations, we only store them at specified intervals to save memory at the expense of activation re-computation in the backward pass. Activation checkpointing can reduce the activation memory footprint by orders of magnitude. However, for massive models, the memory requirement after activation checkpointing can still be too large to fit in GPU memory. To address this, we support activation checkpointing with CPU offload, allowing all the activation to reside in the CPU memory.<\/p>\n\n\n\n\n\n<p>4. <em>Memory-centric operator tiling, <\/em>a novel computation rescheduling technique that works together with the ZeRO data access and communication schedule, allows for reducing the memory footprint of incredibly massive individual layers that can be too large to fit in GPU memory even one layer at a time.<\/p>\n\n\n\n\n\n<p><strong>Memory-centric operator tiling: <\/strong>Models with hundreds of billions to trillions of parameters require significant memory even for individual layers. As an example, a single intermediate layer in a Transformer model with a hidden dimension of 64K requires over 64 GB of memory to store the parameters and gradients in fp16. Computing the forward and backward passes on such a layer requires not only over 64 GB of working memory, but also at least two contiguous memory buffers of over 32 GB, one for parameters and another for the gradients. This is very difficult\u2014even on NVIDIA A100 80 GB GPU cards\u2014due to presence of memory fragmentation in general.<\/p>\n\n\n\n<p>Model parallelism solves this issue by partitioning the individual layer across GPUs. Alternatively, it is possible to partition this layer in the same way as with model parallelism but execute these partitions in sequence on the same GPU. We call this approach <em>memory-centric operator tiling<\/em>. When combined with ZeRO-3, which can gather and remove parameters on demand, this tiling reduces the working memory proportional to the number of partitions, supporting arbitrarily large hidden sizes that would otherwise require model parallelism to fit even a single model layer.<\/p>\n\n\n\n\n\n<h2 id=\"broader-access-to-fine-tuning-extremely-large-models-gpt-3-or-even-larger-models-on-a-single-gpu\">Broader access to fine-tuning extremely large models: GPT-3 or even larger models on a single GPU<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/FIg4_DS_HighRes.jpg\" alt=\"Figure 4: Comparing the largest model sizes that can be trained on a single NVIDIA DGX-2 node using various parallel DL training technologies. The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage. The blue, orange, and green colors are used to represent technologies that use GPU memory only, GPU with CPU memory, and GPU with both CPU and NVMe memory, respectively. ZeRO-Infinity can in fact run with over a trillion parameters even on a single GPU compared to state of the art, which is 13 billion parameters with ZeRO Offload.\" width=\"828\" height=\"414\"\/><figcaption>Figure 4: Comparing the largest model sizes that can be trained on a single NVIDIA DGX-2 node using various parallel DL training technologies. The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage. The blue, orange, and green colors are used to represent technologies that use GPU memory only, GPU with CPU memory, and GPU with both CPU and NVMe memory, respectively. ZeRO-Infinity can in fact run with over a trillion parameters even on a single GPU compared to state of the art, which is 13 billion parameters with ZeRO Offload.<\/figcaption><\/figure><\/div>\n\n\n\n<p>While pretraining is the first important step in creating a massive model, fine-tuning for specific tasks is essential to leveraging the full potential of the model for different scenarios. Making fine-tuning of massive models easily accessible to data scientists could allow the creation of many derived models to meet the need of various application scenarios. These tasks might range from grammar correction to writing assistance, from image captioning to code generation\u2014any task possible with large AI models.<\/p>\n\n\n\n<p>Unlike pretraining, which can require millions of GPU compute hours, fine-tuning a model with hundreds of billions of parameters is much cheaper, requiring significantly less GPU compute hours, and can be done on a single compute node with a handful of GPUs. While such compute resources are accessible to many businesses and users, they are unfortunately restricted by the memory available on these compute nodes, which in turn limits the size of the model that can be fine-tuned. It makes large model fine-tuning inaccessible to most businesses and companies that do not have access to massive GPU clusters.<\/p>\n\n\n\n<p>ZeRO-Infinity completely changes this landscape by enabling data scientists with access to a single node, such as the NVIDIA DGX-2, to fine-tune models with over a trillion parameters (Figure 4). In fact, it can run models with over a trillion parameters even on a single GPU of such a node since it has enough CPU and NVMe memory. This is nearly 100x larger than state of the art for single GPU training. With ZeRO-Infinity, the memory bottleneck is no longer the GPU memory or even the CPU memory. Instead, we can now leverage them together with the much larger and cheaper NVMe memory.<\/p>\n\n\n\n<p>Through ZeRO-Infinity, we take another step toward democratization of AI by enabling users and businesses with limited resources to leverage the power of massive models for their business-specific applications.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1002645\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-label=\"Microsoft research copilot experience\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/MSR-Chat-Promo.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-copilot-experience\" class=\"large\">Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-copilot-experience\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"train-massive-models-without-any-code-refactoring\">Train massive models without any code refactoring<\/h2>\n\n\n\n<p>Scaling models to hundreds of billions\u202fand trillions\u202fof parameters is challenging. Data parallelism cannot scale a model\u2019s size much further beyond a billion parameters. Model parallelism with tensor slicing is challenging to efficiently scale beyond a single node due to communication overheads. Finally, pipeline parallelism cannot scale beyond the number of layers available in a model, which limits both the model size and the scale of GPUs.\u202f<\/p>\n\n\n\n<p>The only existing parallel technology available that can scale to over a trillion parameters on massively parallel GPU clusters is 3D parallelism, which combines data, model, and pipeline parallelism in complex ways. While such a system can be very efficient, it requires data scientists to do major model code refactoring, splitting the model into load-balanced pipeline stages. This also makes 3D parallelism inflexible in the type of models that it can support since models with complex dependency graphs cannot be easily converted into a load-balanced pipeline.&nbsp;<\/p>\n\n\n\n<p>ZeRO-Infinity addresses these challenges in two ways.&nbsp;First,&nbsp;with groundbreaking model scaling, ZeRO-Infinity is the only DL parallel technology that can&nbsp;<em>efficiently<\/em>&nbsp;scale to trillions of parameters without requiring a hybrid parallelism strategy, greatly simplifying the system stack for DL training.&nbsp;Second, ZeRO-Infinity requires virtually no model refactoring from data scientists, liberating data scientists to scale up complex models from hundreds of billions to hundreds of trillions of parameters, as the compute becomes available.<\/p>\n\n\n\n<h2 id=\"excellent-training-efficiency-and-superlinear-scalability\">Excellent training efficiency&nbsp;and superlinear scalability<\/h2>\n\n\n\n<p>ZeRO-Infinity can offload model states and activations to NVMe and CPU, which have orders-of-magnitude slower communication bandwidth (10\u201325 GB\/sec) than GPU memory bandwidth (about 900 GB\/sec). Furthermore, it incurs 50 percent additional GPU-to-GPU communication overhead of ZeRO-3 compared to data-parallel training. Despite these limitations, ZeRO-Infinity can achieve excellent training efficiency that is comparable to state-of-the-art GPU-only solutions like 3D parallelism, and it is significantly better than standard data-parallel training with PyTorch.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"538\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal.jpg\" alt=\"Figure 5: ZeRO-Infinity obtains excellent training efficiency\u2014over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.\" class=\"wp-image-741031\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-300x168.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-768x430.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig5_DeepSpeed-UpdatedFinal-640x360.jpg 640w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><figcaption>Figure 5: ZeRO-Infinity obtains excellent training efficiency\u2014over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.<\/figcaption><\/figure>\n\n\n\n<p>As a concrete example, ZeRO-Infinity achieves a sustained throughput of 37\u201350 teraflops\/GPU for model sizes ranging from 400 billion parameters to 20 trillion parameters running on 512 NVIDIA V100 GPUs (see Figure 5). In comparison, 3D parallelism achieves very similar throughput (48 teraflops\/GPU) for a 650-billion-parameter model, the largest model that can be trained on the same number of GPUs. Standard data-parallel training with PyTorch only achieves 30 teraflops per GPU for a 1.3 billion-parameter model, the largest model that can be trained using data parallelism alone.<\/p>\n\n\n\n<p>There are three key innovations behind the excellent training efficiency of ZeRO-Infinity:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><em>Bandwidth-centric partitioning enables parallel memory access resulting in virtually unlimited heterogeneous memory bandwidth. <\/em>With ZeRO-Infinity, the effective NVMe and CPU memory bandwidth grow linearly with the number of available devices. For instance, the NVMe bandwidth is about 25 GB\/sec per DGX-2 node, but on a cluster with 64 such nodes, this increases to 1.6 TB\/sec, even faster than the GPU HBM2 memory on the NVIDIA V100 GPU that can achieve 0.9 TB\/sec.<\/li><\/ol>\n\n\n\n\n\n<p><strong>Bandwidth-centric partitioning:<\/strong> In the original ZeRO, parameters for each layer are owned by a unique data-parallel process, requiring each rank to broadcast the parameters when needed. If these parameters are located in CPU memory, then they first must be copied to GPU before the broadcast operation. The copy bandwidth is therefore limited by a single PCIe link bandwidth. On the contrary, in ZeRO-Infinity, the parameters for each layer are partitioned across all data-parallel processes, and they use an all-gather operation instead of broadcast when needed. If parameters for each layer are located in GPU memory, this makes no difference\u2014as both broadcast and all-gather have the same communication cost. But if they are located in CPU, this makes a significant difference as each data-parallel process only transfers its partition of the parameters to the GPU in parallel before all-gather is done. Therefore, ZeRO-Infinity can leverage the aggregate bandwidth across all PCIe links instead of being bottlenecked by a single PCIe link.<\/p>\n\n\n\n\n\n<p><em>2. Communication-overlap-centric design and implementation <\/em>allows ZeRO-Infinity to hide nearly all communication volume at a reasonable batch size. ZeRO-Infinity can effectively overlap NVMe read\/write, CPU-GPU data transfers, GPU-GPU communication, and GPU computation all at once.<\/p>\n\n\n\n\n\n<p><strong>Overlap-centric design: <\/strong>With the option of offloading model states to CPU and NVMe, overlapping communication is challenging. Before partitioned parameters can be reconstructed on the GPU using all-gather, it must be first brought from NVMe to CPU and then from CPU to the GPU. Therefore, retrieving a parameter during training requires a three-step communication process which can severely limit training efficiency.<\/p>\n\n\n\n<p>To hide the cost of this communication, ZeRO-Infinity implements a dynamic prefetcher that traces the forward and backward computation on the fly, constructing an internal map of the operator sequence for each iteration. During each iteration, the prefetcher tracks where it is in the operator sequence and can prefetch the parameter required by future operators. The prefetcher is aware of the three-step communication process, and therefore can overlap the NVMe-CPU transfer for parameters of one layer, with CPU-GPU transfer and GPU-GPU all-gather of parameters of other layers, effectively overlapping all three communication stages with compute.<\/p>\n\n\n\n\n\n<p><em>3. DeepNVMe module<\/em>, created by the DeepSpeed team, allows for asynchronously reading and writing tensors to NVMe storage at near-peak NVMe bandwidth in PyTorch.<\/p>\n\n\n\n\n\n<p><strong>DeepNVMe Module: <\/strong>NVMe is a storage interface that is designed to fully utilize the maximum I\/O performance of modern Solid State Disk (SSD) devices. However, despite the popularity of SSDs, it remains difficult for most applications to enjoy the full benefits of SSDs due to the lack of user-level libraries that conveniently and efficiently provide NVMe functionality. An effective user-level NVMe module must address at least two key challenges: 1) provide a convenient interface to enable integration without major redesign of application logic and 2) furnish an efficient software that imposes negligible overhead on request processing. DeepNVMe was developed to fill this gap as a user-level library that enables applications to easily exploit the maximum SSD performance. DeepNVMe addresses the interface and performance challenges of user-level NVMe libraries as follows.<\/p>\n\n\n\n<p>From an interface perspective, DeepNVMe allows applications to flexibly submit both blocking and non-blocking I\/O requests (reads or writes), and it allows synchronization requests to flush pending read or write requests. This flexible interface enables easy integration of DeepNVMe into existing application logic rather than restructuring application to adapt to DeepNVMe.<\/p>\n\n\n\n<p>From a performance perspective, DeepNVMe allows applications to leverage both intra-request parallelism (I\/O request from one user thread) and inter-request parallelism (I\/O requests from multiple user threads). DeepNVMe efficiently supports these different forms of request parallelism using a number of optimizations including low-overhead multi-threading, smart work scheduling, avoiding data copying, and memory pinning. ZeRO-Infinity uses DeepNVMe to copy tensors to and from the local SSD devices in order to make space in GPU and CPU memory for training multi-trillion parameter models. The benefit of DeepNVMe is clearly demonstrated by the ability to effectively support a massive, data-intensive, and performance-critical application like ZeRO-Infinity.<\/p>\n\n\n\n\n\n<p>In addition to achieving high training efficiency, ZeRO-Infinity preserves superlinear scalability (see Figure 6) that we have demonstrated with all our previous ZeRO technologies (ZeRO-1, ZeRO-2, and ZeRO-Offload). This is possible because of the memory-and-compute access pattern of ZeRO-Infinity\u2014it reduces the NVMe\/CPU communication time as well as the optimizer update time linearly with the increasing number of GPUs and nodes, respectively.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/Fig6_DS_HighRes.jpg\" alt=\"Figure 6: ZeRO-Infinity achieves superlinear scalability with an increase in GPU count by leveraging the aggregate PCIe, CPU-memory, and NVMe-memory bandwidth, which also increases with the GPU count. Furthermore, it also leverages the aggregate CPU compute that increases linearly with the number of compute nodes, further supporting superlinear scaling.\" width=\"840\" height=\"489\"\/><figcaption>Figure 6: ZeRO-Infinity achieves superlinear scalability with an increase in GPU count by leveraging the aggregate PCIe, CPU-memory, and NVMe-memory bandwidth, which also increases with the GPU count. Furthermore, it also leverages the aggregate CPU compute that increases linearly with the number of compute nodes, further supporting superlinear scaling.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"zero-infinity-redefines-the-large-model-training-landscape\">ZeRO-Infinity redefines the large model training landscape<\/h2>\n\n\n\n<p>It was less than a year ago that 3D parallelism enabled training of a model at a scale of a trillion parameters with 800 NVIDIA V100 GPUs. Now, with ZeRO-Infinity, the same scale can be achieved on a single DGX-2 node (16 V100 GPUs) with virtually no model refactoring. Massive model training is no longer just a possibility for companies with access to massive supercomputers and heavy system expertise. Instead, it\u2019s now easily accessible to many data scientists with access to only a single GPU or a few GPUs.<\/p>\n\n\n\n<p>In addition, ZeRO-Infinity offers a paradigm shift in how we think about memory for large model training. It is no longer necessary to fit DL training on ultra-fast yet expensive memory with limited size, like HBM2. ZeRO-Infinity demonstrates that it is possible to transcend the GPU memory wall by leveraging cheap and slow, but massive, CPU or NVMe memory in parallel across multiple devices to achieve the aggregate bandwidth necessary for efficient training.<\/p>\n\n\n\n<p>With memory no longer a limitation on model scale or efficiency, it is now critical that we focus on the innovations in compute performance and GPU-to-GPU bandwidth. While it is now possible to fit a 30 trillion parameter model for training on 512 NVIDIA V100 GPUs with ZeRO-Infinity, it is very challenging to complete the end-to-end pre-training in a reasonable time. This could demand 100x improvements in compute performance and the interconnect bandwidth between GPUs compared to what is available on current NVIDIA DGX V100 clusters. The state-of-the-art NVIDIA A100 GPUs and the DGX A100 nodes are good steps in that direction offering over 3x \u2013 6x in compute performance and 2x improvement in interconnect bandwidth per GPU than the NVIDIA DGX V100 nodes. We welcome such improvements, and are excited that the NVIDIA A100 GPU will soon be available through <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-compute\/azure-announces-public-availability-of-nd-a100-v4-ai\/ba-p\/1892832\">Azure ND A100 v4 VMs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>Finally, we hope that with memory no longer a limitation, ZeRO-Infinity will further inspire an acceleration in compute and network bandwidth focused design of future ultra-powerful devices and supercomputing clusters necessary for the next 1000x growth in model scale and the quality improvements they will offer.<\/p>\n\n\n\n<p>Please read&nbsp;our&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-infinity-breaking-the-gpu-memory-wall-for-extreme-scale-deep-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">ZeRO-Infinity paper<\/a>&nbsp;for more details and visit the&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSpeed website<\/a>&nbsp;and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;for the codes, tutorials, and documentations about these new technologies!&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>About DeepSpeed\u2019s integration with Azure Machine Learning and open-source solutions<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Azure Machine Learning:<\/strong> DeepSpeed and Azure Machine Learning team have made it simple for users to train DeepSpeed-powered models on Azure Machine Learning. Specifically, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fmachine-learning%2Fresource-curated-environments%23deepspeed&data=04%7C01%7Cminjiaz%40microsoft.com%7C5324d163bde742c7502308d8dee4cc84%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637504421569482855%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=a2t1H3bgPCN1RqHBLCfIslmYgLcw0ZegHKFnXBKXsqQ%3D&reserved=0\">DeepSpeed curated environment<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> makes it simple for users to get started with DeepSpeed on Azure. Example DeepSpeed models are actively being added to the official <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Azure\/azureml-examples\">Azure Machine Learning-examples repo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Get started with our Open AI <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazureml-examples%2Ftree%2Fmain%2Fworkflows%2Ftrain%2Fdeepspeed%2Ftransformers&data=04%7C01%7Cjingzhao%40microsoft.com%7C00b55825683b4ad2d51e08d8e1ff1020%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637507832893374505%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ztHO%2FL8heMlxGnUrVdBWSgY%2BLdDPiUaWfNeNtgllKuM%3D&reserved=0\">GPT-2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazureml-examples%2Fblob%2Fmain%2Fworkflows%2Ftrain%2Fdeepspeed%2Fcifar%2Fjob.py&data=04%7C01%7Cjingzhao%40microsoft.com%7C00b55825683b4ad2d51e08d8e1ff1020%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637507832893384465%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EZhVngpijUp5xZsdqqUVcya3NllyQU0RqmFgF8EFbFo%3D&reserved=0\">cifar<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> examples. Azure Machine Learning provides powerful GPU support to accelerate model development. <\/li><li><strong>Hugging Face:<\/strong> Hugging Face recently announced its <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/huggingface.co\/blog\/zero-deepspeed-fairscale\">integration with DeepSpeed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which allows users to easily accelerate their models through a simple \u201c\u2014deepspeed\u201d flag and config file. Through this integration, DeepSpeed is able to bring <strong>3x faster<\/strong> speedups in multi-GPU training compared with the original solution. DeepSpeed also allows fitting a significantly larger model for users who own just a single GPU (or a few GPUs) with much higher compute efficiency than alternatives.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fgithub.com%2FPyTorchLightning%2Fpytorch-lightning&data=04%7C01%7Cminjiaz%40microsoft.com%7C5324d163bde742c7502308d8dee4cc84%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637504421569472902%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4v6HmCh3B%2B8VgWRJZVgseTssNyr1NAlAVpA9U%2FI%2FIFo%3D&reserved=0\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>PyTorch lighting<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong>:&nbsp;<\/strong>We are happy to announce that&nbsp;PyTorch&nbsp;Lightning integrates&nbsp;DeepSpeed&nbsp;as a plugin for&nbsp;DL&nbsp;training optimizations:&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/medium.com\/pytorch-lightning\/accessible-multi-billion-parameter-model-training-with-pytorch-lightning-deepspeed-c9333ac3bb59\" target=\"_blank\" rel=\"noopener noreferrer\">Accessing Multi-Billion Parameter Model Training&nbsp;with Pytorch Lightning + DeepSpeed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;To enable&nbsp;DeepSpeed&nbsp;in Lightning 1.2, it is as simple as passing plugins=&#8217;deepspeed&#8217; to the Lightning trainer (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/pytorch-lightning.readthedocs.io\/en\/1.2.0\/advanced\/multi_gpu.html?highlight=deepspeed#deepspeed\" target=\"_blank\" rel=\"noopener noreferrer\">docs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).&nbsp;<\/li><\/ul>\n\n\n\n<p><strong>About the DeepSpeed Team:<\/strong><\/p>\n\n\n\n<p>We are a group of system researchers and engineers\u2014Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Shaden Smith, Elton Zheng, Reza Yazdani Aminabadi, Arash Ashari, Ammar Ahmad Awan, Cheng Li, Conglong Li, Niranjan Uma Naresh, Minjia Zhang, Jeffrey Zhu, Yuxiong He (team lead)\u2014who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning\u2019s speed to train, speed to convergence, and speed to develop!<\/p>\n\n\n\n<p>If this type of work interests you, the DeepSpeed team is hiring both researchers and engineers! Please visit our&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/careers.microsoft.com\/us\/en\/search-results?keywords=deepspeed%20open-source\">careers page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since the DeepSpeed optimization library was introduced last year, it has rolled out numerous novel optimizations for training large AI models\u2014improving scale, speed, cost, and usability. As large models have quickly evolved over the last year, so too has DeepSpeed. Whether enabling researchers to create the 17-billion-parameter Microsoft Turing Natural Language Generation (Turing-NLG) with state-of-the-art [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":741007,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-738250","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[649749],"related-events":[],"related-researchers":[{"type":"guest","value":"deepspeed-team","user_id":"690909","display_name":"DeepSpeed Team","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/#!people\" aria-label=\"Visit the profile page for DeepSpeed Team\">DeepSpeed Team<\/a>","is_active":true,"last_first":"Team, DeepSpeed","people_section":0,"alias":"deepspeed-team"},{"type":"user_nicename","value":"Rangan Majumder","user_id":38931,"display_name":"Rangan Majumder","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ranganm\/\" aria-label=\"Visit the profile page for Rangan Majumder\">Rangan Majumder<\/a>","is_active":false,"last_first":"Majumder, Rangan","people_section":0,"alias":"ranganm"},{"type":"guest","value":"andrey-proskurin","user_id":"740575","display_name":"Andrey  Proskurin","author_link":"<a href=\"https:\/\/ca.linkedin.com\/in\/andreyproskurin\" aria-label=\"Visit the profile page for Andrey  Proskurin\">Andrey  Proskurin<\/a>","is_active":true,"last_first":"Proskurin, Andrey ","people_section":0,"alias":"andrey-proskurin"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-960x540.jpg\" class=\"img-object-cover\" alt=\"DeepSpeed figure\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/#!people\" title=\"Go to researcher profile for DeepSpeed Team\" aria-label=\"Go to researcher profile for DeepSpeed Team\" data-bi-type=\"byline author\" data-bi-cN=\"DeepSpeed Team\">DeepSpeed Team<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ranganm\/\" title=\"Go to researcher profile for Rangan Majumder\" aria-label=\"Go to researcher profile for Rangan Majumder\" data-bi-type=\"byline author\" data-bi-cN=\"Rangan Majumder\">Rangan Majumder<\/a>, and <a href=\"https:\/\/ca.linkedin.com\/in\/andreyproskurin\" title=\"Go to researcher profile for Andrey  Proskurin\" aria-label=\"Go to researcher profile for Andrey  Proskurin\" data-bi-type=\"byline author\" data-bi-cN=\"Andrey  Proskurin\">Andrey  Proskurin<\/a>","formattedDate":"April 19, 2021","formattedExcerpt":"Since the DeepSpeed optimization library was introduced last year, it has rolled out numerous novel optimizations for training large AI models\u2014improving scale, speed, cost, and usability. As large models have quickly evolved over the last year, so too has DeepSpeed. Whether enabling researchers to create&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/738250","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=738250"}],"version-history":[{"count":61,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/738250\/revisions"}],"predecessor-version":[{"id":845773,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/738250\/revisions\/845773"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/741007"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=738250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=738250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=738250"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=738250"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=738250"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=738250"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=738250"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=738250"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=738250"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=738250"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=738250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}