DeepSpeed - Microsoft Research: Timeline

2023

Feb

DeepSpeed supports automatic tensor parallelism for HuggingFace models

Previously, a user needed to provide an injection policy to DeepSpeed to enable tensor parallelism. DeepSpeed now supports automatic tensor parallelism for HuggingFace models by default as long as kernel injection is not enabled and an injection policy is not provided. This allows our users to improve performance of models that are not currently supported via kernel injection, without providing the injection policy. See our tutorial of the new automatic tensor parallelism feature for inference.

TUTORIAL Automatic Tensor Parallelism for HuggingFace Models

2022

Dec

DeepSpeed Data Efficiency Library: Towards Less Data, Faster Training, and Higher Model Quality

DeepSpeed releases a new Data Efficiency Library to reduce training data and cost with boosted model quality via new innovations on data sampling and data routing with composable and customizable library support. The library greatly reduces training cost while maintaining model quality (1.5-2x less data and time for GPT-3/BERT pretraining), or further improves model quality under same training cost (>1 point gain for GPT-3-1.3B zero/few-shot evaluation).

BLOG DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality

2022

Nov

Achieve sub-second Stable Diffusion Image Generation with DeepSpeed-MII

Stable Diffusion (opens in new tab) is a latent text-to-image diffusion model, which is capable of creating stunning art within seconds. In this tutorial you will learn how to deploy and run Stable Diffusion with state-of-the-art performance optimizations from DeepSpeed-Inference (opens in new tab) and DeepSpeed-MII (opens in new tab) and achieve image generation under one second.

TUTORIAL Stable Diffusion Image Generation under 1 second with DeepSpeed-MII

2022

Oct

DeepSpeed Model Implementations for Inference (MII)

DeepSpeed-MII, a new open-source Python library from DeepSpeed, is now available to make low-latency, low-cost inference of powerful deep learning models not only feasible but also easily accessible to everyone. MII /ɛm-aɪ-tu/ offers highly optimized implementations of thousands of widely used deep learning models, e.g., MII can speed up Stable Diffusion by 1.9x, Big Science Bloom 176B model by 5.7x with 40x cost reduction.

BLOG DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference

2022

Sep

ZeRO-Inference: Democratizing massive model inference

ZeRO-Inference (opens in new tab) comes from the family of ZeRO technologies, which are a collection of powerful memory and parallelism optimizations for efficient large scale model training and inference on modern GPU clusters. ZeRO-Inference enables inference computation of massive models (with hundreds of billions of parameters) on as few as a single GPU, thereby making massive model inference accessible to almost everyone. Moreover, by dramatically reducing GPU memory requirements with CPU or NVMe memory which are significantly cheaper, it significantly reduces the cost of massive model inference, offering an affordable inference path to SOTA models.

2022

Jul

DeepSpeed helped train 176 Billion parameter BLOOM model

The 176B BLOOM model has been trained using Megatron-DeepSpeed, which is a combination of 2 main technologies: DeepSpeed and Metatron-LM. DeepSpeed developed a 3D parallelism based implementation by combining ZeRO sharding and pipeline parallelism from the DeepSpeed library with Tensor Parallelism from Megatron-LM.

BLOG The Technology Behind BLOOM Training

DeepSpeed Compression: A composable library for extreme compression

DeepSpeed releases a new pillar, DeepSpeed Compression (opens in new tab), to tackle latency and cost challenges on deploying large-scale deep learning models. It offers novel compression algorithms, supports synergistic composition of state-of-the-art compression methods, boosts deployments by making inference speed faster, model size smaller, while dramatically shortening the compression cost. We demonstrated 32x smaller model size, 5.2x better efficiency, and 5000x lower compression cost with this release.

BLOG DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

Azure and DeepSpeed empower easy-to-use and high-performance model training

Azure ML, Azure HPC, and DeepSpeed collaborated and made large-scale distributed training easier and more efficient on Azure using DeepSpeed technology. We developed and released simple-to-use training pipelines for both Azure ML and Azure HPC. Azure and DeepSpeed combined offer excellent performance and scalability. We have scaled model sizes to 2 trillion parameters, scaled various workloads to 1024 A100-80GB GPUs, and obtained up to 1.8x higher throughput compared to the latest results published on other cloud providers.

BLOG Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed

2022

Mar

DeepSpeed support for efficient large model training on AMD GPUs

We are excited to announce that DeepSpeed’s suite of training optimizations for efficient large model training is now available on ROCm-enabled AMD GPUs. This means that powerful parallelism and memory optimizations such as ZeRO, ZeRO-Offload, ZeRO-Infinity and 3D parallelism can used while training with AMD GPUs.

BLOG Supporting efficient large model training on AMD Instinct™ GPUs with DeepSpeed

2021

Oct

DeepSpeed trained the world’s most powerful language model: Megatron-Turing NLG 530B

We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI models.

BLOG Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model