{"id":678390,"date":"2020-07-23T17:00:06","date_gmt":"2020-07-24T00:00:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=678390"},"modified":"2024-09-09T08:34:50","modified_gmt":"2024-09-09T15:34:50","slug":"deepspeed","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/","title":{"rendered":"DeepSpeed"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"4400\" height=\"1650\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720.png\" class=\"attachment-full size-full\" alt=\"DeepSpeed\" style=\"object-position: 80% 20%\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720.png 4400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-300x113.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-1024x384.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-768x288.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-1536x576.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-2048x768.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-1920x720.png 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-1600x600.png 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/DeepSpeed_AI_header4_10-2022_1920x720-240x90.png 240w\" sizes=\"auto, (max-width: 4400px) 100vw, 4400px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading h2\" id=\"deepspeed\">DeepSpeed<\/h1>\n\n\n\n<p>Extreme Speed and Scale for DL Training and Inference<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"287\" height=\"107\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DeepSpeed_light.png\" alt=\"deepspeed logo\" class=\"wp-image-888996\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DeepSpeed_light.png 287w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DeepSpeed_light-240x89.png 240w\" sizes=\"auto, (max-width: 287px) 100vw, 287px\" \/><\/figure>\n\n\n\n<h5 class=\"wp-block-heading is-style-default\" id=\"deepspeed-is-an-easy-to-use-deep-learning-optimization-software-suite-that-enables-unprecedented-scale-and-speed-for-dl-training-and-inference-visit-us-at-deepspeed-ai-or-our-github-repo\">DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for DL Training and Inference. Visit us at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\" target=\"_blank\" rel=\"noopener noreferrer\">deepspeed.ai<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" target=\"_blank\" rel=\"noopener noreferrer\">Github repo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/h5>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<table><tbody><tr><td width=\"100\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DS-T-logo.png\" alt=\"DS-Training logo\" style=\"width: 75px\"><\/td><td valign=\"bottom\"><h3>Reshape Large Model Training Landscape<\/h3><\/td><\/tr><tr><td><\/td><td>DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the DeepSpeed-Training pillar. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/training\/\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more >><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><\/tbody><\/table>\n\n\n\n<table><tbody><tr><td width=\"100\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DS-I-logo.png\" alt=\"DS-Training logo\" style=\"width: 75px\"><\/td><td valign=\"bottom\"><h3>Optimize Large Model Inference<\/h3><\/td><\/tr><tr><td><\/td><td>DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the DeepSpeed-Inference. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/inference\/\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more >><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><\/tbody><\/table>\n\n\n\n<table><tbody><tr><td width=\"100\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/DS-C-logo.png\" alt=\"DS-Training logo\" style=\"width: 75px\"><\/td><td valign=\"bottom\"><h3>Speed Up Inference & Reduce Model Size<\/h3><\/td><\/tr><tr><td><\/td><td>To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the DeepSpeed-Compression pillar. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/compression\/\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more >><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\n<\/td><\/tr><\/tbody><\/table>\n\n\n\n<div style=\"height:50px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"deepspeed-model-implementations-for-inference-mii\">DeepSpeed Model Implementations for Inference (MII)<\/h2>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"instant-speedup-on-24-000-open-source-dl-models-with-up-to-40x-cheaper-inference\">Instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference.<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large is-resized is-style-default\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/hero.png\" alt=\"DeepSpeed-MII Supported Models\" style=\"width:800px\" \/><\/figure>\n\n\n\n<p>The Deep Learning (DL) open-source community has seen tremendous growth in the last few months. Incredibly powerful text generation models such as the Bloom 176B, or image generation models such as Stable Diffusion are now available to anyone with access to a handful or even a single GPU through platforms such as Hugging Face. While open-sourcing has democratized access to AI capabilities, their application is still restricted by two critical factors: 1) inference latency and 2) cost.<\/p>\n\n\n\n<p>There has been significant progress in system optimizations for DL model inference that can drastically reduce both latency and cost, but those are not easily accessible. The main reason for this limited accessibility is that the DL model inference landscape is diverse with models varying in size, architecture, system performance characteristics, hardware requirements, etc. Identifying the appropriate set of system optimizations applicable to a given model and applying them correctly is often beyond the scope of most data scientists, making low latency and low-cost inference mostly inaccessible.<\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed-MII\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed-MII<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;is a new open-source python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>MII offers access to highly optimized implementations of <b>thousands of widely used DL models.<\/b><\/li>\n\n\n\n<li>MII supported models achieve significantly lower latency and cost compared to their original implementation.\n<ul class=\"wp-block-list\">\n<li>MII reduces the <b>latency of Big-Science Bloom 176B model by 5.7x<\/b>, while reducing the <b>cost by over 40x<\/b> as shown in Figures 2 (left) and 8.<\/li>\n\n\n\n<li>MII reduces the <b>latency and cost of deploying Stable Diffusion by 1.9x<\/b> as shown in Figure 2 (right).<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>To enable low latency\/cost inference, MII leverages an extensive set of optimizations from DeepSpeed-Inference such as <i>deepfusion<\/i> for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with <i>ZeroQuant<\/i>, and several others (see below for more details).<\/li>\n\n\n\n<li>With state-of-the-art performance, MII supports low-cost deployment of these models both on-premises and on Azure via AML with just <b>a few lines of codes.<\/b><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-does-mii-work\">How does MII work?<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/mii-arch.png\" alt=\"DeepSpeed-MII Architecture\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 1: MII Architecture, showing how MII automatically optimizes OSS models using DS-Inference before deploying them on-premises using GRPC, or on Microsoft Azure using AML Inference.<\/em><\/figcaption><\/figure>\n\n\n\n<p>Under-the-hood MII is powered by&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2207.00032\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed-Inference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Based on the model type, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations from DeepSpeed-Inference to minimize latency and maximize throughput. It does so by using one of many pre-specified model injection policies, that allows MII and DeepSpeed-Inference to identify the underlying PyTorch model architecture and replace it with an optimized implementation (see&nbsp;<em>Figure 1<\/em>). In doing so, MII makes the expansive set of optimizations in DeepSpeed-Inference automatically available for thousands of popular models that it supports.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"supported-models-and-tasks\">Supported Models and Tasks<\/h3>\n\n\n\n<p>MII supports a growing list of tasks such as text generation, question-answering, text classification, etc, across thousands of transformer models available through multiple open-sourced model repositories such as Hugging Face, FairSeq, EluetherAI, etc. It supports dense models based on BERT, RoBERTa, GPT, OPT, and BLOOM architectures ranging from a few hundred million parameters in size to hundreds of billions of parameters in size. At the same time, it supports recent image generation models such as Stable Diffusion.<\/p>\n\n\n\n<p>See the MII GitHub repo for an up-to-date list of&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/deepspeed-mii#supported-models-and-tasks\" target=\"_blank\" rel=\"noopener noreferrer\">models and tasks supported by MII<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"inference-optimizations-with-mii\">Inference Optimizations with MII<\/h3>\n\n\n\n<p>Here we provide a summary of the expansive set of optimizations from DeepSpeed-Inference made available via MII. For more details, please refer to [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2207.00032\" target=\"_blank\" rel=\"noopener noreferrer\">1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2206.01861\" target=\"_blank\" rel=\"noopener noreferrer\">2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]:<\/p>\n\n\n\n<p><strong>DeepFusion for Transformers:<\/strong>&nbsp;For transformer-based models such as Bert, Roberta, GPT-2, and GPT-J, MII leverages the transformer kernels in DeepSpeed-Inference that are optimized to achieve low latency at small batch sizes and high throughput at large batch sizes using DeepFusion.<\/p>\n\n\n\n<p><strong>Multi-GPU Inference with Tensor-Slicing:<\/strong>&nbsp;For massive models such as Bloom 176B, MII automatically enables tensor-parallelism within a node to leverage aggregate memory bandwidth and compute across multiple GPUs to achieve the lowest latency and throughput compared to anything else that is currently available.<\/p>\n\n\n\n<p><strong>INT8 Inference with ZeroQuant:<\/strong>&nbsp;For massive models with tens or hundreds of billions of parameters, MII supports INT8 Inference with ZeroQuant. Using this feature not only reduces the memory footprint and the number of GPUs required for inference but also increases the inference throughput by supporting larger batch sizes and using INT8 compute, thus lowering cost compared to FP16.<\/p>\n\n\n\n<p><strong>ZeRO-Inference for Resource Constrained Systems:<\/strong>&nbsp;Models such as Bloom 176B, require over 176 GB of memory to just fit the model even with INT8 support. In the absence of the aggregate GPU memory across multiple GPUs required to deploy such models, MII enables&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/2022\/09\/09\/zero-inference.html\" target=\"_blank\" rel=\"noopener noreferrer\">ZeRO-Inference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;that can leverage the system CPU memory to deploy these massive models with a single GPU with limited memory.<\/p>\n\n\n\n<p><strong>Compiler Optimizations:<\/strong>&nbsp;When applicable, MII automatically applies compiler-based optimizations via&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/pytorch.org\/docs\/stable\/jit.html\" target=\"_blank\" rel=\"noopener noreferrer\">TorchScript<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/pytorch.org\/blog\/introducing-nvfuser-a-deep-learning-compiler-for-pytorch\/\" target=\"_blank\" rel=\"noopener noreferrer\">nvFuser<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/developer.nvidia.com\/blog\/cuda-graphs\/\" target=\"_blank\" rel=\"noopener noreferrer\">CUDA graph<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, in addition to the above optimizations, to further lower latency and improve throughput.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"quantifying-latency-and-cost-reduction\">Quantifying Latency and Cost Reduction<\/h3>\n\n\n\n<p>Inference workloads can be either latency critical, where the primary objective is to minimize latency, or cost sensitive, where the primary objective is to minimize cost. In this section, we quantify the benefits of using MII for both latency-critical and cost-sensitive scenarios.<\/p>\n\n\n\n<p>MII can work with two variations of DeepSpeed-Inference. The first, referred to as ds-public, contains most of the optimizations discussed above and is also available via our open-source DeepSpeed library. The second referred to as ds-azure, offers tighter integration with Azure, and is available via MII to all Microsoft Azure customers. We refer to MII running the two DeepSpeed-Inference variants as MII-Public and MII-Azure, respectively.<\/p>\n\n\n\n<p>Both MII-Public and MII-Azure offer significant latency and cost reduction compared to open-sourced PyTorch implementation (Baseline), however for certain generative workloads, they can have differentiated performance. Here, we quantify the latency and cost reduction for both variations.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"latency-critical-scenarios\">Latency Critical Scenarios<\/h4>\n\n\n\n<p>For latency-critical scenarios, where a small batch size of 1 is often used, MII can reduce the latency by up to 6x for a wide range of open-source models, across multiple tasks. More specifically, we show model latency reduction of&nbsp;<sup><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.deepspeed.ai\/2022\/10\/10\/mii.html#fn:overhead_details\">1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/sup>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Up to 5.7x for multi-GPU inference for text generation using massive models such as Big Science Bloom, Facebook OPT, and EluetherAI NeoX (<em>Figure 2 (left)<\/em>)<\/li>\n\n\n\n<li>Up to 1.9x for image generation tasks model using Stable Diffusion (<em>Figure 2 (right)<\/em>)<\/li>\n\n\n\n<li>Up to 3x for relatively smaller text generation models (up to 7B parameters) based on OPT, BLOOM, and GPT architectures, running on a single GPU (<em>Figures 3 and 4<\/em>)<\/li>\n\n\n\n<li>Up to 9x for various text representation tasks like fill-mask, text classification, question answering, and token classification using RoBERTa- and BERT- based models (<em>Figures 5 and 6<\/em>).<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/llm-latency-sd-latency.png\" alt=\"SD Latency Comparison\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 2: (left) Best achievable latency for large models. MII-Azure (int8) offers 5.7X lower latency compared to Baseline for Bloom-176B. (right) Stable Diffusion text to image generation latency comparison.<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/opt-bloom.png\" alt=\"BLOOM Optimization\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 3: Latency comparison for OPT and BLOOM models. MII-Azure is up to 2.8x faster than baseline.<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/gpt.png\" alt=\"GPT data\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 4: Latency comparison for GPT models. MII-Azure is up to 3x faster than baseline.<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/roberta.png\" alt=\"RoBERTa models\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 5: Latency comparison for RoBERTa models. MII offers up to 9x lower model latency and up to 3x lower end-to-end latency than baseline on several tasks and RoBERTa variants&nbsp;<sup><a href=\"#overhead_details\">1<\/a><\/sup><\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/bert.png\" alt=\"BERT models\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 6: Latency comparison for BERT models. MII offers up to 8.9x lower model latency and up to 4.5x end-to-end latency across several tasks and BERT variants<sup><a href=\"#overhead_details\">1<\/a><\/sup>.<\/em><\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"cost-sensitive-scenarios\">Cost Sensitive Scenarios<\/h4>\n\n\n\n<p>MII can significantly reduce the inference cost of very expensive language models like Bloom, OPT, etc. To get the lowest cost, we use a large batch size that maximizes throughput for both baseline and MII. Here we look at the cost reduction from MII using two different metrics: i) tokens generated per second per GPU, and ii) dollars per million tokens generated.<\/p>\n\n\n\n<p><em>Figures 7 and 8<\/em>&nbsp;show that MII-Public offers over 10x throughput improvement and cost reduction compared to the baseline, respectively. Furthermore, MII-Azure offers over 30x improvement in throughput and cost compared to the baseline.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/tput-llms.png\" alt=\"Model Throughput per GPU\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 7: Throughput comparison per A100-80GB GPU for large models. MII-Public offers over 15x throughput improvement while MII-Azure offers over 40x throughput improvement.<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/www.deepspeed.ai\/assets\/images\/mii\/azure-cost.png\" alt=\"Azure cost\" style=\"width:800px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 8: Cost of generating 1 million tokens on Azure with different model types. MII-Azure reduces the cost of generation by over 40x.<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"deployment-options\">Deployment Options<\/h3>\n\n\n\n<p>MII supported models can be deployed in two different ways as shown in&nbsp;<em>Figure 1<\/em>&nbsp;with just a few lines of code.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"mii-public-deployment\">MII-Public Deployment<\/h4>\n\n\n\n<p>MII-Public can be deployed on-premises or on any cloud offering. MII creates a lightweight GRPC server to support this form of deployment and provides a GRPC inference endpoint for queries. The code below shows how a supported model can be deployed with MII-Public Deployment.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import mii\n mii.deploy(task=\"text-to-image\",\n           model=\"CompVis\/stable-diffusion-v1-4\",\n           deployment_name=\"sd-deployment\")<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"mii-azure-deployment\">MII-Azure Deployment<\/h4>\n\n\n\n<p>MII supports deployment on Azure via AML Inference. To enable this, MII generates AML deployment assets for a given model that can be deployed using the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/learn.microsoft.com\/en-us\/cli\/azure\/what-is-azure-cli\" target=\"_blank\" rel=\"noopener noreferrer\">Azure-CLI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, as shown in the code below. Furthermore, deploying on Azure, allows MII to leverage DeepSpeed-Azure as its optimization backend, which offers better latency and cost reduction than DeepSpeed-Public.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import mii\n mii.deploy(task=\"text-to-image\",\n           model=\"CompVis\/stable-diffusion-v1-4\",\n           deployment_name=\"sd-deployment\",\n           deployment_type=DeploymentType.AML)<\/code><\/pre>\n\n\n\n<p>To learn more about these deployment options and get started with MII, please the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/deepspeed-mii#getting-started-with-mii\" target=\"_blank\" rel=\"noopener noreferrer\">MII getting started guide<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"concluding-remarks\">Concluding Remarks<\/h3>\n\n\n\n<p>We are very excited to share MII with the community and improve it with your feedback. We will continue to add support for more models in MII as well as enhance both MII-Public and MII-Azure for both on-premise and Azure users. Our hope is that while open sourcing has made powerful AI capabilities accessible to many, MII will allow for a wider infusion of these capabilities into a diverse set of applications and product offerings by instantly reducing the latency and cost of inferencing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"appendix\">Appendix<\/h3>\n\n\n\n<p>The table below shows the mapping between model aliases used in&nbsp;<em>Figures 3, 4, 5, and 6<\/em>&nbsp;and real model names.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-regular\"><table><thead><tr><th>Alias<\/th><th>Model Name<\/th><\/tr><\/thead><tbody><tr><td>text-gen-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/sberbank-ai\/rugpt3large_based_on_gpt2\" target=\"_blank\" rel=\"noopener noreferrer\">sberbank-ai\/rugpt3large_based_on_gpt2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/skt\/kogpt2-base-v2\" target=\"_blank\" rel=\"noopener noreferrer\">skt\/kogpt2-base-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/geralt\/MechDistilGPT2\" target=\"_blank\" rel=\"noopener noreferrer\">geralt\/MechDistilGPT2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/mrm8488\/distilgpt2-finetuned-wsb-tweets\" target=\"_blank\" rel=\"noopener noreferrer\">mrm8488\/distilgpt2-finetuned-wsb-tweets<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/Norod78\/hebrew-bad_wiki-gpt_neo-tiny\" target=\"_blank\" rel=\"noopener noreferrer\">Norod78\/hebrew-bad_wiki-gpt_neo-tiny<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m6<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/shibing624\/code-autocomplete-distilgpt2-python\" target=\"_blank\" rel=\"noopener noreferrer\">shibing624\/code-autocomplete-distilgpt2-python<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-gen-m7<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/mrm8488\/diltilgpt2-finetuned-bookcopus-10\" target=\"_blank\" rel=\"noopener noreferrer\">mrm8488\/diltilgpt2-finetuned-bookcopus-10<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-q&a-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/bert-large-uncased-whole-word-masking-finetuned-squad\" target=\"_blank\" rel=\"noopener noreferrer\">bert-large-uncased-whole-word-masking-finetuned-squad<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-q&a-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/deepset\/bert-large-uncased-whole-word-masking-squad2\" target=\"_blank\" rel=\"noopener noreferrer\">deepset\/bert-large-uncased-whole-word-masking-squad2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-q&a-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/nyust-eb210\/braslab-bert-drcd-384\" target=\"_blank\" rel=\"noopener noreferrer\">nyust-eb210\/braslab-bert-drcd-384<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-q&a-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/deepset\/minilm-uncased-squad2\" target=\"_blank\" rel=\"noopener noreferrer\">deepset\/minilm-uncased-squad2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-token-class-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dslim\/bert-large-NER\" target=\"_blank\" rel=\"noopener noreferrer\">dslim\/bert-large-NER<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-token-class-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dbmdz\/bert-large-cased-finetuned-conll03-english\" target=\"_blank\" rel=\"noopener noreferrer\">dbmdz\/bert-large-cased-finetuned-conll03-english<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-token-class-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dslim\/bert-base-NER\" target=\"_blank\" rel=\"noopener noreferrer\">dslim\/bert-base-NER<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-token-class-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/CAMeL-Lab\/bert-base-arabic-camelbert-mix-ner\" target=\"_blank\" rel=\"noopener noreferrer\">CAMeL-Lab\/bert-base-arabic-camelbert-mix-ner<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/bert-base-multilingual-cased\" target=\"_blank\" rel=\"noopener noreferrer\">bert-base-multilingual-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/bert-base-multilingual-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">bert-base-multilingual-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/wietsedv\/bert-base-dutch-cased\" target=\"_blank\" rel=\"noopener noreferrer\">wietsedv\/bert-base-dutch-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/nlpaueb\/bert-base-greek-uncased-v1\" target=\"_blank\" rel=\"noopener noreferrer\">nlpaueb\/bert-base-greek-uncased-v1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dbmdz\/bert-base-italian-xxl-cased\" target=\"_blank\" rel=\"noopener noreferrer\">dbmdz\/bert-base-italian-xxl-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m6<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/aubmindlab\/bert-base-arabertv02\" target=\"_blank\" rel=\"noopener noreferrer\">aubmindlab\/bert-base-arabertv02<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m7<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dccuchile\/bert-base-spanish-wwm-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">dccuchile\/bert-base-spanish-wwm-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m8<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/bert-base-german-cased\" target=\"_blank\" rel=\"noopener noreferrer\">bert-base-german-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m9<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/bert-base-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">bert-base-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m10<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/dbmdz\/bert-base-german-cased\" target=\"_blank\" rel=\"noopener noreferrer\">dbmdz\/bert-base-german-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m11<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/nlpaueb\/legal-bert-base-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">nlpaueb\/legal-bert-base-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m12<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/KB\/bert-base-swedish-cased\" target=\"_blank\" rel=\"noopener noreferrer\">KB\/bert-base-swedish-cased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m13<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/indolem\/indobertweet-base-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">indolem\/indobertweet-base-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m14<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/emilyalsentzer\/Bio_ClinicalBERT\" target=\"_blank\" rel=\"noopener noreferrer\">emilyalsentzer\/Bio_ClinicalBERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-fill-mask-m15<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/asafaya\/bert-mini-arabic\" target=\"_blank\" rel=\"noopener noreferrer\">asafaya\/bert-mini-arabic<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/DTAI-KULeuven\/mbert-corona-tweets-belgium-topics\" target=\"_blank\" rel=\"noopener noreferrer\">DTAI-KULeuven\/mbert-corona-tweets-belgium-topics<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/avichr\/heBERT_sentiment_analysis\" target=\"_blank\" rel=\"noopener noreferrer\">avichr\/heBERT_sentiment_analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/finiteautomata\/beto-sentiment-analysis\" target=\"_blank\" rel=\"noopener noreferrer\">finiteautomata\/beto-sentiment-analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/ProsusAI\/finbert\" target=\"_blank\" rel=\"noopener noreferrer\">ProsusAI\/finbert<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cross-encoder\/ms-marco-MiniLM-L-12-v2\" target=\"_blank\" rel=\"noopener noreferrer\">cross-encoder\/ms-marco-MiniLM-L-12-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m6<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/nlptown\/bert-base-multilingual-uncased-sentiment\" target=\"_blank\" rel=\"noopener noreferrer\">nlptown\/bert-base-multilingual-uncased-sentiment<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m7<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/xtremedistil-l6-h256-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">microsoft\/xtremedistil-l6-h256-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>bert-text-class-m8<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cross-encoder\/ms-marco-MiniLM-L-6-v2\" target=\"_blank\" rel=\"noopener noreferrer\">cross-encoder\/ms-marco-MiniLM-L-6-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/vinai\/bertweet-large\" target=\"_blank\" rel=\"noopener noreferrer\">vinai\/bertweet-large<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/klue\/roberta-large\" target=\"_blank\" rel=\"noopener noreferrer\">klue\/roberta-large<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/sberbank-ai\/ruRoberta-large\" target=\"_blank\" rel=\"noopener noreferrer\">sberbank-ai\/ruRoberta-large<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>q&a-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/deepset\/roberta-large-squad2\" target=\"_blank\" rel=\"noopener noreferrer\">deepset\/roberta-large-squad2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>token-class-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/Jean-Baptiste\/roberta-large-ner-english\" target=\"_blank\" rel=\"noopener noreferrer\">Jean-Baptiste\/roberta-large-ner-english<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m1<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cross-encoder\/stsb-roberta-large\" target=\"_blank\" rel=\"noopener noreferrer\">cross-encoder\/stsb-roberta-large<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/siebert\/sentiment-roberta-large-english\" target=\"_blank\" rel=\"noopener noreferrer\">siebert\/sentiment-roberta-large-english<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/roberta-large-mnli\" target=\"_blank\" rel=\"noopener noreferrer\">roberta-large-mnli<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/vinai\/bertweet-base\" target=\"_blank\" rel=\"noopener noreferrer\">vinai\/bertweet-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/vinai\/phobert-base\" target=\"_blank\" rel=\"noopener noreferrer\">vinai\/phobert-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m6<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/graphcodebert-base\" target=\"_blank\" rel=\"noopener noreferrer\">microsoft\/graphcodebert-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m7<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/vinai\/bertweet-covid19-base-uncased\" target=\"_blank\" rel=\"noopener noreferrer\">vinai\/bertweet-covid19-base-uncased<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m8<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/uklfr\/gottbert-base\" target=\"_blank\" rel=\"noopener noreferrer\">uklfr\/gottbert-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m9<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cardiffnlp\/twitter-roberta-base\" target=\"_blank\" rel=\"noopener noreferrer\">cardiffnlp\/twitter-roberta-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m10<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/codebert-base-mlm\" target=\"_blank\" rel=\"noopener noreferrer\">microsoft\/codebert-base-mlm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m11<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/pdelobelle\/robbert-v2-dutch-base\" target=\"_blank\" rel=\"noopener noreferrer\">pdelobelle\/robbert-v2-dutch-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m12<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/ufal\/robeczech-base\" target=\"_blank\" rel=\"noopener noreferrer\">ufal\/robeczech-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>q&a-m2<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/Rakib\/roberta-base-on-cuad\" target=\"_blank\" rel=\"noopener noreferrer\">Rakib\/roberta-base-on-cuad<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>q&a-m3<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/thatdramebaazguy\/roberta-base-squad\" target=\"_blank\" rel=\"noopener noreferrer\">thatdramebaazguy\/roberta-base-squad<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/roberta-base-openai-detector\" target=\"_blank\" rel=\"noopener noreferrer\">roberta-base-openai-detector<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/pysentimiento\/robertuito-emotion-analysis\" target=\"_blank\" rel=\"noopener noreferrer\">pysentimiento\/robertuito-emotion-analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m6<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cardiffnlp\/twitter-roberta-base-sentiment\" target=\"_blank\" rel=\"noopener noreferrer\">cardiffnlp\/twitter-roberta-base-sentiment<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m7<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cardiffnlp\/twitter-roberta-base-sentiment-latest\" target=\"_blank\" rel=\"noopener noreferrer\">cardiffnlp\/twitter-roberta-base-sentiment-latest<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>q&a-m4<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/deepset\/roberta-base-squad2\" target=\"_blank\" rel=\"noopener noreferrer\">deepset\/roberta-base-squad2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m8<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/textattack\/roberta-base-SST-2\" target=\"_blank\" rel=\"noopener noreferrer\">textattack\/roberta-base-SST-2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m9<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/cardiffnlp\/twitter-roberta-base-emotion\" target=\"_blank\" rel=\"noopener noreferrer\">cardiffnlp\/twitter-roberta-base-emotion<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m10<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/pysentimiento\/robertuito-sentiment-analysis\" target=\"_blank\" rel=\"noopener noreferrer\">pysentimiento\/robertuito-sentiment-analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m11<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/finiteautomata\/bertweet-base-sentiment-analysis\" target=\"_blank\" rel=\"noopener noreferrer\">finiteautomata\/bertweet-base-sentiment-analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>fill-mask-m13<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/huggingface\/CodeBERTa-small-v1\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface\/CodeBERTa-small-v1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>q&a-m5<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/deepset\/tinyroberta-squad2\" target=\"_blank\" rel=\"noopener noreferrer\">deepset\/tinyroberta-squad2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><tr><td>text-class-m12<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/j-hartmann\/emotion-english-distilroberta-base\" target=\"_blank\" rel=\"noopener noreferrer\">j-hartmann\/emotion-english-distilroberta-base<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<br><p>&nbsp;<\/p>\n\n\n\n<p class=\"has-gray-color has-text-color\" id=\"overhead_details\">1. The end-to-end latency of an inference workload is comprised of two components: i) actual model execution, and ii) pre-\/post-processing before and after the model execution. MII optimizes the actual model execution but leaves the pre-\/post-processing pipeline for future optimizations. We notice that text representation tasks have significant pre-\/post-processing overhead (<em>Figures G and H<\/em>). We plan to address those in a future update.<\/p>\n\n\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\t<div class=\"wp-block-msr-block-journey journey journey--date alignwide\" data-bi-aN=\"block-journey\">\n\t\t<ol class=\"journey__list\">\n\t\t\t\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2023\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tFeb\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-supports-automatic-tensor-parallelism-for-huggingface-models\">DeepSpeed supports automatic tensor parallelism for HuggingFace models<\/h3>\n\n\n\n<p>Previously, a user needed to provide an injection policy to DeepSpeed to enable tensor parallelism. DeepSpeed now supports automatic tensor parallelism for HuggingFace models by default as long as kernel injection is not enabled and an injection policy is not provided. This allows our users to improve performance of models that are not currently supported via kernel injection, without providing the injection policy. See our tutorial of the new automatic tensor parallelism feature for inference.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">TUTORIAL<\/span>\n\t\t\t<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/automatic-tensor-parallelism\/\" data-bi-cN=\"Automatic Tensor Parallelism for HuggingFace Models\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Automatic Tensor Parallelism for HuggingFace Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tDec\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-data-efficiency-library-towards-less-data-faster-training-and-higher-model-quality\">DeepSpeed Data Efficiency Library: Towards Less Data, Faster Training, and Higher Model Quality<\/h3>\n\n\n\n<p>DeepSpeed releases a new Data Efficiency Library to reduce training data and cost with boosted model quality via new innovations on data sampling and data routing with composable and customizable library support.&nbsp;The library greatly reduces training cost while maintaining model quality (1.5-2x less data and time for GPT-3\/BERT pretraining), or further improves model quality under same training cost (>1 point gain for GPT-3-1.3B zero\/few-shot evaluation).<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/www.deepspeed.ai\/2022\/12\/11\/data-efficiency.html\" data-bi-cN=\"DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tNov\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"achieve-sub-second-stable-diffusion-image-generation-with-deepspeed-mii\">Achieve sub-second Stable Diffusion Image Generation with DeepSpeed-MII<\/h3>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/CompVis\/stable-diffusion-v1-4\" target=\"_blank\" rel=\"noopener noreferrer\">Stable Diffusion<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a latent text-to-image diffusion model, which is capable of creating stunning art within seconds. In this tutorial you will learn how to deploy and run Stable Diffusion with state-of-the-art performance optimizations from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/deepspeed\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed-Inference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/deepspeed-mii\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed-MII<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and achieve image generation under one second.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">TUTORIAL<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/DeepSpeed-MII\/tree\/main\/examples\/benchmark\/txt2img\" data-bi-cN=\"Stable Diffusion Image Generation under 1 second with DeepSpeed-MII\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Stable Diffusion Image Generation under 1 second with DeepSpeed-MII<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tOct\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-model-implementations-for-inference-mii\">DeepSpeed Model Implementations for Inference (MII)<\/h3>\n\n\n\n<p>DeepSpeed-MII, a new open-source Python library from DeepSpeed, is now available to make low-latency, low-cost inference of powerful deep learning models not only feasible but also easily accessible to everyone. MII <em>\/\u025bm-a\u026a-tu\/<\/em> offers highly optimized implementations of thousands of widely used deep learning models, e.g., MII can speed up Stable Diffusion by 1.9x, Big Science Bloom 176B model by 5.7x with 40x cost reduction.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/www.deepspeed.ai\/2022\/10\/10\/mii.html\" data-bi-cN=\"DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tSep\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"zero-inference-democratizing-massive-model-inference\">ZeRO-Inference: Democratizing massive model inference<\/h3>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/2022\/09\/09\/zero-inference.html\" target=\"_blank\" rel=\"noopener noreferrer\">ZeRO-Inference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> comes from the family of ZeRO technologies, which are a collection of powerful memory and parallelism optimizations for efficient large scale model training and inference on modern GPU clusters. ZeRO-Inference enables inference computation of massive models (with hundreds of billions of parameters) on as few as a single GPU, thereby making massive model inference accessible to almost everyone. Moreover, by dramatically reducing GPU memory requirements with CPU or NVMe memory which are significantly cheaper, it significantly reduces the cost of massive model inference, offering an affordable inference path to SOTA models.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tJul\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-helped-train-176-billion-parameter-bloom-model\">DeepSpeed helped train 176 Billion parameter BLOOM model<\/h3>\n\n\n\n<p>The 176B BLOOM model has been trained using Megatron-DeepSpeed, which is a combination of 2 main technologies: DeepSpeed and Metatron-LM. DeepSpeed developed a 3D parallelism based implementation by combining ZeRO sharding and pipeline parallelism from the DeepSpeed library with Tensor Parallelism from Megatron-LM.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/huggingface.co\/blog\/bloom-megatron-deepspeed\" data-bi-cN=\"The Technology Behind BLOOM Training\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>The Technology Behind BLOOM Training<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-compression-a-composable-library-for-extreme-compression\">DeepSpeed Compression: A composable library for extreme compression<\/h3>\n\n\n\n<p>DeepSpeed releases a new pillar, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.deepspeed.ai\/compression\/\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed Compression<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, to tackle latency and cost challenges on deploying large-scale deep learning models. It offers novel compression algorithms, supports synergistic composition of state-of-the-art compression methods, boosts deployments by making inference speed faster, model size smaller, while dramatically shortening the compression cost. We demonstrated 32x smaller model size, 5.2x better efficiency, and 5000x lower compression cost with this release.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization\/\" data-bi-cN=\"DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"azure-and-deepspeed-empower-easy-to-use-and-high-performance-model-training\">Azure and DeepSpeed empower easy-to-use and high-performance model training<\/h3>\n\n\n\n<p>Azure ML, Azure HPC, and DeepSpeed collaborated and made large-scale distributed training easier and more efficient on Azure using DeepSpeed technology. We developed and released simple-to-use training pipelines for both Azure ML and Azure HPC. Azure and DeepSpeed combined offer excellent performance and scalability. We have scaled model sizes to 2 trillion parameters, scaled various workloads to 1024 A100-80GB GPUs, and obtained up to 1.8x higher throughput compared to the latest results published on other cloud providers.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed\/\" data-bi-cN=\"Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2022\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tMar\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-support-for-efficient-large-model-training-on-amd-gpus\">DeepSpeed support for efficient large model training on AMD GPUs<\/h3>\n\n\n\n<p>We are excited to announce that DeepSpeed&#8217;s suite of training optimizations for efficient large model training is now available on ROCm-enabled AMD GPUs. This means that powerful parallelism and memory optimizations such as ZeRO, ZeRO-Offload, ZeRO-Infinity and 3D parallelism can used while training with AMD GPUs.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/cloudblogs.microsoft.com\/opensource\/2022\/03\/21\/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed\/\" data-bi-cN=\"Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Supporting efficient large model training on AMD Instinct\u2122 GPUs with DeepSpeed<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tOct\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title is-style-default\" id=\"deepspeed-trained-the-world-s-most-powerful-language-model-megatron-turing-nlg-530b\">DeepSpeed trained the world\u2019s most powerful language model: Megatron-Turing NLG 530B<\/h3>\n\n\n\n<p>We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI models.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/\" data-bi-cN=\"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World\u2019s Largest and Most Powerful Generative Language Model\" target=\"_blank\" rel=\"noopener noreferrer\" data-external-link=\"true\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World\u2019s Largest and Most Powerful Generative Language Model<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-open-in-new-tab\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\t\t<\/ol>\n\t<\/div>\n\t<\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>DeepSpeed, part of the Microsoft AI at Scale initiative, is a deep learning optimization library that makes distributed training easy, efficient, and effective.<\/p>\n","protected":false},"featured_media":885894,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13547],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-678390","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-systems-and-networking","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[626646,740689,748009,777727,813424,852399,852405,862791,909324,951438,1051764,1083108,1083120,1083591,1090623],"related-downloads":[],"related-videos":[739810],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[798607,1095585,994098,968301,949047,895428,861387,811273,804280,635250,783490,766675,747265,689370,658872,658659,635340],"related-articles":[],"tab-content":[{"id":0,"name":"Why DeepSpeed","content":"<h2 id=\"why-deepspeed\">Why DeepSpeed?<\/h2>\r\nTraining advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development\u00a0<em>and<\/em>\u00a0training.\r\n<h3 id=\"distributed-effective-and-efficient-training-with-ease\">Distributed, Effective, and Efficient Training with Ease<\/h3>\r\nThe DeepSpeed API is a lightweight wrapper on\u00a0<a href=\"https:\/\/pytorch.org\/\" target=\"_blank\" rel=\"noopener\">PyTorch<\/a>. This means that you can use everything you love in PyTorch and without learning a new platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art training techniques, such as distributed training, mixed precision, gradient accumulation, and checkpoints so that you can focus on your model development. Most importantly, you can leverage the distinctive efficiency and effectiveness benefit of DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch models.\r\n<h3 id=\"speed\">Speed<\/h3>\r\nDeepSpeed achieves high performance and fast convergence through a combination of efficiency optimizations on compute\/communication\/memory\/IO and effectiveness optimizations on advanced hyperparameter tuning and optimizers. For example:\r\n<ul>\r\n \t<li>DeepSpeed trains BERT-large to parity in 44 mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs (16 DGX-2 boxes).<\/li>\r\n \t<li>DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs.<\/li>\r\n<\/ul>\r\n<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/megatron\/\" target=\"_blank\" rel=\"noopener\">Read the GPT tutorial &gt;<\/a>\r\n<h3 id=\"memory-efficiency\">Memory efficiency<\/h3>\r\nDeepSpeed provides memory-efficient data parallelism and enables training models without model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on NVIDIA V100 GPUs with 32GB of device memory. In comparison, existing frameworks (e.g., PyTorch\u2019s Distributed Data Parallel) run out of memory with 1.4 billion parameter models.\r\n\r\nDeepSpeed reduces the training memory footprint through a novel solution called Zero Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states and gradients to save significant memory. Furthermore, it also reduces activation memory and fragmented memory. The current implementation (ZeRO-2) reduces memory by up to 8x relative to the state-of-art. You can read more about ZeRO in our\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">paper<\/a>, and in our blog posts related to\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters\/\">ZeRO-1<\/a>.\r\n\r\nWith this impressive memory reduction, early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\">Turing-NLG<\/a>, establishing a new SOTA in the LM category.\r\n<h3 id=\"scalability\">Scalability<\/h3>\r\nDeepSpeed supports efficient data parallelism, model parallelism, and their combination. ZeRO boosts the scaling capability and efficiency further.\r\n<ul>\r\n \t<li>DeepSpeed provides system support to run models up to 170 billion parameters, 10x larger than the state-of-art (8 billion NVIDIA GPT, 11 billion Google T5).<\/li>\r\n \t<li>DeepSpeed can run large models more efficiently, up to 10x faster for models with various sizes spanning 1.5B to 170B. More specifically, the data parallelism powered by ZeRO is complementary and can be combined with different types of model parallelism. It allows DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering significant performance gains compared to using model parallelism alone. Read more:\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">ZeRO paper<\/a>, and\u00a0<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/megatron\" target=\"_blank\" rel=\"noopener\">GPT tutorial<\/a>.<\/li>\r\n<\/ul>\r\n[caption id=\"\" align=\"alignnone\" width=\"1751\"]<img src=\"https:\/\/www.deepspeed.ai\/assets\/images\/deepspeed-speedup.png\" alt=\"DeepSpeed Speedup\" width=\"1751\" height=\"682\" \/> The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.[\/caption]\r\n<h3 id=\"fast-convergence-for-effectiveness\">Fast convergence for effectiveness<\/h3>\r\nDeepSpeed supports advanced hyperparameter tuning and large batch size optimizers such as\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1904.00962\" target=\"_blank\" rel=\"noopener\">LAMB<\/a>. These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy.\r\n\r\n<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/1Cycle\" target=\"_blank\" rel=\"noopener\">Read the Tuning tutorial &gt;<\/a>\r\n<h3 id=\"good-usability\">Good Usability<\/h3>\r\nOnly a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA\u2019s Megatron-LM."},{"id":1,"name":"Features","content":"<h2>DeepSpeed features<\/h2>\r\nBelow we provide a brief feature list, see our detailed\u00a0<a href=\"https:\/\/www.deepspeed.ai\/features\/\" target=\"_blank\" rel=\"noopener\">feature overview<\/a>\u00a0for descriptions and usage.\r\n<ul>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#distributed-training-with-mixed-precision\" target=\"_blank\" rel=\"noopener\">Distributed training with mixed precision<\/a>\r\n<ul>\r\n \t<li>16-bit mixed precision<\/li>\r\n \t<li>Single-GPU\/Multi-GPU\/Multi-Node<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#model-parallelism\" target=\"_blank\" rel=\"noopener\">Model parallelism<\/a>\r\n<ul>\r\n \t<li>Support for Custom Model Parallelism<\/li>\r\n \t<li>Integration with Megatron-LM<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#the-zero-redundancy-optimizer\" target=\"_blank\" rel=\"noopener\">The Zero Redundancy Optimizer (ZeRO)<\/a>\r\n<ul>\r\n \t<li>Optimizer State and Gradient Partitioning<\/li>\r\n \t<li>Activation Partitioning<\/li>\r\n \t<li>Constant Buffer Optimization<\/li>\r\n \t<li>Contiguous Memory Optimization<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#additional-memory-and-bandwidth-optimizations\" target=\"_blank\" rel=\"noopener\">Additional memory and bandwidth optimizations<\/a>\r\n<ul>\r\n \t<li>Smart Gradient Accumulation<\/li>\r\n \t<li>Communication\/Computation Overlap<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#training-features\" target=\"_blank\" rel=\"noopener\">Training features<\/a>\r\n<ul>\r\n \t<li>Simplified training API<\/li>\r\n \t<li>Activation Checkpointing API<\/li>\r\n \t<li>Gradient Clipping<\/li>\r\n \t<li>Automatic loss scaling with mixed precision<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#training-optimizers\" target=\"_blank\" rel=\"noopener\">Training optimizers<\/a>\r\n<ul>\r\n \t<li>Fused Adam optimizer and arbitrary\u00a0<code class=\"language-plaintext highlighter-rouge\">torch.optim.Optimizer<\/code><\/li>\r\n \t<li>Memory bandwidth optimized FP16 Optimizer<\/li>\r\n \t<li>Large Batch Training with LAMB Optimizer<\/li>\r\n \t<li>Memory efficient Training with ZeRO Optimizer<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#training-agnostic-checkpointing\" target=\"_blank\" rel=\"noopener\">Training agnostic checkpointing<\/a><\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#advanced-parameter-search\">Advanced parameter search<\/a>\r\n<ul>\r\n \t<li>Learning Rate Range Test<\/li>\r\n \t<li>1Cycle Learning Rate Schedule<\/li>\r\n<\/ul>\r\n<\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#simplified-data-loader\" target=\"_blank\" rel=\"noopener\">Simplified data loader<\/a><\/li>\r\n \t<li><a href=\"https:\/\/www.deepspeed.ai\/features\/#performance-analysis-and-debugging\" target=\"_blank\" rel=\"noopener\">Performance analysis and debugging<\/a><\/li>\r\n<\/ul>"},{"id":2,"name":"Contributing","content":"<h2>Contributing to DeepSpeed<\/h2>\r\nDeepSpeed welcomes your contributions! Please see our\u00a0<a href=\"https:\/\/www.deepspeed.ai\/contributing\/\" target=\"_blank\" rel=\"noopener\">contributing<\/a>\u00a0guide for more details on formatting, testing, etc.\r\n<h3 id=\"contributor-license-agreement\" class=\"active\">Contributor License Agreement<\/h3>\r\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https:\/\/cla.opensource.microsoft.com.\r\n\r\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.\r\n<h3 id=\"code-of-conduct\">Code of Conduct<\/h3>\r\nThis project has adopted the\u00a0<a href=\"https:\/\/opensource.microsoft.com\/codeofconduct\/\" target=\"_blank\" rel=\"noopener\">Microsoft Open Source Code of Conduct<\/a>. For more information see the\u00a0<a href=\"https:\/\/opensource.microsoft.com\/codeofconduct\/faq\/\" target=\"_blank\" rel=\"noopener\">Code of Conduct FAQ<\/a>\u00a0or contact\u00a0<a href=\"mailto:opencode@microsoft.com\">opencode@microsoft.com<\/a>\u00a0with any additional questions or comments."}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Sam Ade Jacobs","user_id":43503,"people_section":"Section name 0","alias":"samjacobs"},{"type":"user_nicename","display_name":"Ammar Ahmad Awan","user_id":39537,"people_section":"Section name 0","alias":"amawa"},{"type":"user_nicename","display_name":"Martin Cai","user_id":32856,"people_section":"Section name 0","alias":"mcai"},{"type":"user_nicename","display_name":"Elton Zheng","user_id":39576,"people_section":"Section name 0","alias":"eltonz"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/678390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":61,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/678390\/revisions"}],"predecessor-version":[{"id":1083612,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/678390\/revisions\/1083612"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/885894"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=678390"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=678390"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=678390"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=678390"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=678390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}