{"id":766675,"date":"2021-08-18T09:59:54","date_gmt":"2021-08-18T16:59:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=766675"},"modified":"2021-10-06T16:11:15","modified_gmt":"2021-10-06T23:11:15","slug":"deepspeed-powers-8x-larger-moe-model-training-with-high-performance","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-powers-8x-larger-moe-model-training-with-high-performance\/","title":{"rendered":"DeepSpeed powers 8x larger MoE model training with high performance"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-scaled.jpg\" alt=\"Graphs of DeepSpeed \"\/><\/figure>\n\n\n\n<p>Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.deepspeed.ai\">DeepSpeed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> optimization library. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the Switch Transformer consists of 1.6 trillion parameters, while the compute required to train it is approximately equal to that of a 10 billion-parameter dense model. This increase in model size offers tremendous accuracy gains for a constant compute budget.<\/p>\n\n\n\n<p>However, supporting these MoE models with trillions of parameters requires a complex combination of multiple forms of parallelism that is simply not available in current MoE systems. DeepSpeed MoE overcomes these challenges through a symphony of multidimensional parallelism and heterogenous memory technologies, such as <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">Zero Redundancy Optimizer (ZeRO)<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-offload-democratizing-billion-scale-model-training\/\">ZeRO-Offload<\/a>, harmoniously coming together to support massive MoE models\u2014even on limited GPU resources\u2014achieving efficiency, scalability, and ease-of-use. It enables 3.5 trillion-parameter models on 512 GPUs, 8x larger than existing work, while achieving 100 teraflops (TFLOPS) per GPU and attaining near-linear scalability with respect to the number of GPUs.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\" data-bi-cN=\"ZeRO: Memory Optimizations Toward Training Trillion Parameter Models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO: Memory Optimizations Toward Training Trillion Parameter Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-offload-democratizing-billion-scale-model-training\/\" data-bi-cN=\"ZeRO-Offload: Democratizing Billion-Scale Model Training\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO-Offload: Democratizing Billion-Scale Model Training<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p>Besides supporting the most ambitious scale MoE models, DeepSpeed MoE boosts the development productivity and resource efficiency of training modestly sized MoE models in production scenarios, which may be of broader interest to the deep learning (DL) community. As an example, we use it to train <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/\">Z-code<\/a> MoE, a production-quality, multilingual, and multitask language model with 10 billion parameters, achieving state-of-the-art results on machine translation and cross-lingual summarization tasks. &nbsp;<\/p>\n\n\n\n<h2 id=\"massive-moe-training-8x-larger-models-on-the-same-hardware\">Massive MoE training: 8x larger models on the same hardware<\/h2>\n\n\n\n<p>DeepSpeed MoE supports five different forms of parallelism, and it exploits both GPU and CPU memory. Its flexible design enables users to mix different types of prevalent parallelism techniques, as shown in Table 1.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"562\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1-1024x562.png\" alt=\"table: Flexible parallelism dimensions supported by DeepSpeed MoE\" class=\"wp-image-766705\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1-1024x562.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1-300x165.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1-768x422.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1-240x132.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeedTable1.png 1146w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 1: Flexible parallelism dimensions supported by DeepSpeed MoE<\/figcaption><\/figure><\/div>\n\n\n\n<p>Existing MoE systems support only expert, data, and model parallelism or a subset of them. This leads to three major limitations: i) They replicate the base model (part of the model without expert parameters) across data-parallel GPUs, resulting in wasted memory, (ii) They need model parallelism to scale the base model to go beyond 1.4 billion parameters, requiring nontrivial model code refactoring, and iii) The total MoE model size is restricted by the total available GPU memory.<\/p>\n\n\n\n<p>By systematically combining expert, model, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">ZeRO parallelism<\/a>, DeepSpeed MoE surpasses the first two limitations, supporting base models with up to 15 billion parameters, larger than the base model of the Switch Transformer (Switch-C 1.6 trillion parameters with a base model size of less than 5 billion parameters). This is over 10x larger compared with existing MoE systems that can support only 1.4 billion parameters without adding the complexity of model parallelism. When combined with model parallelism, the base model alone can have over 100 billion parameters, which is simply not possible with existing systems.<\/p>\n\n\n\n<p>In addition, with support for ZeRO-Offload, DeepSpeed MoE transcends the GPU memory wall, supporting MoE models with 3.5 trillion parameters on 512 NVIDIA A100 GPUs by leveraging both GPU and CPU memory. This is an 8x increase in the total model size (3.5 trillion vs. 400 billion) compared with existing MoE systems that are limited by the total GPU memory. Alternatively, DeepSpeed MoE achieves the same model scale with 8x fewer resources (400 billion on 64 GPUs instead of 512 GPUs), as shown in Figure 1.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"988\" height=\"599\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Figure1_updated.png\" alt=\"Figure: Compared with existing work, DeepSpeed MoE powers 8x bigger models using the same number of GPUs, or equivalently, requires 8x fewer GPUs to support the same model size.\" class=\"wp-image-767146\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Figure1_updated.png 988w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Figure1_updated-300x182.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Figure1_updated-768x466.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Figure1_updated-240x146.png 240w\" sizes=\"auto, (max-width: 988px) 100vw, 988px\" \/><figcaption>Figure 1: Compared with existing work, DeepSpeed MoE powers 8x bigger models using the same number of GPUs, or equivalently, requires 8x fewer GPUs to support the same model size.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"increased-throughput-near-linear-scalability-and-ease-of-use\">Increased throughput, near-linear scalability, and ease of use<\/h2>\n\n\n\n<p>Given the sparse nature of MoE models, it is challenging to keep GPU utilization as high as dense models do. To this end, we have performed a thorough cross-analysis of system performance and model configurations for DeepSpeed MoE. For model configurations that are optimized for system performance, we achieve over 100 TFLOPS per GPU performance on NVIDIA A100 GPUs, which is on par with dense model training scenarios. Additionally, DeepSpeed MoE exhibits near-linear scalability for throughput as the number of GPUs increases, as shown in Figure 2.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"592\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated-1024x592.png\" alt=\"Figure 2: DeepSpeed MoE scales near-linearly with respect to the number of GPUs.\" class=\"wp-image-767161\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated-1024x592.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated-300x173.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated-768x444.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated-240x139.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6-figure2_updated.png 1259w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: DeepSpeed MoE scales near-linearly with respect to the number of GPUs.<\/figcaption><\/figure><\/div>\n\n\n\n<p>While DeepSpeed MoE is a highly scalable and high-performance system, we have carefully designed its user-facing API, shown in Figure 3, to be flexible and simple to use. It enables users to enhance their models with MoE layers without complex code changes to their training pipeline. It supports various MoE-specific parameters, including number of experts, type of expert, and different gating functions (top-1, top-2, noisy, and 32-bit). In addition, we have devised a new technique called \u201cRandom Token Selection,\u201d described in more detail in our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.deepspeed.ai\/tutorials\/mixture-of-experts\">tutorial<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which greatly improves convergence, is part of the DeepSpeed library, and is enabled by default so users can take advantage of it without any code changes.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"147\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3-1024x147.png\" alt=\"Figure 3: DeepSpeed MoE layer API\" class=\"wp-image-766714\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3-1024x147.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3-300x43.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3-768x110.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3-240x34.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig3.png 1195w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: DeepSpeed MoE layer API<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"powering-state-of-the-art-performance-in-production-scenarios\">Powering state-of-the-art performance in production scenarios<\/h2>\n\n\n\n<p>Z-code, a part of Microsoft&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/msturing.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Project Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;consists&nbsp;of a family of multilingual pretrained models that can be used for various downstream language tasks.&nbsp;The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\">DeepSpeed<\/a> library has been used to scale and accelerate the training of many Z-code models, resulting in state-of-the-art performance on various tasks. Powered by DeepSpeed MoE, we can now train MoE models that are much larger compared with dense models.<\/p>\n\n\n\n<p>We trained a 24-layer encoder and a 12-layer decoder transformer model that had 1024 embedding and 4096 feedforward layer dimensions. By including experts in every other encoder\/decoder layer, similar to the Switch Transformer, we scaled up a 700 million-parameter dense model from 1.8 billion parameters (using eight experts) to 10 billion parameters (using 64 experts). As shown in Figure 4, all MoE configurations converge faster and achieve better loss with much fewer training steps. The 10 billion-parameter model is 14 times larger than its dense equivalent model and was trained using fewer than 10 times the training steps and five times less wall-clock time, reaching a loss value of 4.3 on 64 NVIDIA A100 GPUs. This highlights the efficiency and the scalability of the MoE model.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig4.png\" alt=\"Figure 4: Better model convergence with more experts.\"\/><figcaption>Figure 4: Better model convergence with more <em>experts<\/em>.<\/figcaption><\/figure><\/div>\n\n\n\n<p>The models were trained as generic text-to-text transformation models, similar to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.10683\">T5<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We trained them on machine translation and denoising auto-encoding tasks using multilingual parallel and monolingual data with 336 billion tokens in 50 languages. We used Z-code MoE (10B), shown in Figure 5, to demonstrate its efficacy on various downstream tasks. We are excited to report that the Z-code MoE model outperformed both the dense model and the highly optimized individual bilingual models.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"822\" height=\"613\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated.png\" alt=\"Figure 5: Z-code MoE (10B) outperforms other systems on BLEU scores for an in-house 50 language test dataset.\" class=\"wp-image-767164\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated.png 822w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated-300x224.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated-768x573.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_figure5_updated-240x180.png 240w\" sizes=\"auto, (max-width: 822px) 100vw, 822px\" \/><figcaption>Figure 5: Z-code MoE (10B) outperforms other systems on BLEU scores for an in-house 50 language test dataset.<\/figcaption><\/figure><\/div>\n\n\n\n<p><strong>MoE with multitask learning. <\/strong>We found multitask learning to be very efficient in utilizing multiple learning tasks for improving downstream tasks, especially in multilingual and multimodal setups. We also found that dense models are harder to learn in multitask setups due to the difficulty of balancing model capacity between tasks. By leveraging DeepSpeed MoE, the Z-code MoE model achieved improved convergence and higher quality with multitask training setting compared with the non-MoE model, as shown in Figure 6.<\/p>\n\n\n\n<p>The dense model performed better with a single machine translation task while the Z-code MoE model performed better with a multitask learning setup. This enabled the Z-code MoE model to achieve state-of-the-art performance on multilingual machine translation upstream tasks, outperforming the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2010.11125\">M2M<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> 12 billion-parameter model, as shown in Figure 7.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"876\" height=\"684\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig6.png\" alt=\"Figure 6: Better multitask performance with MoE architecture\" class=\"wp-image-766726\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig6.png 876w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig6-300x234.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig6-768x600.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed_6_Fig6-231x180.png 231w\" sizes=\"auto, (max-width: 876px) 100vw, 876px\" \/><figcaption>Figure 6: Better multitask performance with MoE architecture<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"775\" height=\"625\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig7_updated.png\" alt=\"Figure 7: BLEU scores on a WMT dataset on 18 language pairs translating from and to English\" class=\"wp-image-767167\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig7_updated.png 775w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig7_updated-300x242.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig7_updated-768x619.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig7_updated-223x180.png 223w\" sizes=\"auto, (max-width: 775px) 100vw, 775px\" \/><figcaption>Figure 7: BLEU scores on a WMT dataset on 18 language pairs translating from and to English<\/figcaption><\/figure><\/div>\n\n\n\n<p>Z-code MoE as a multitask multilingual model can be utilized for various downstream language tasks. The model achieved a state-of-the-art result on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2010.03093.pdf\">Wikilingua<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, a multilingual abstractive summarization task, by scoring 50 percent more than the best published ROUGE score, as illustrated in Figure 8. We did this by fine-tuning the pretrained Z-code MoE model with the dataset.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/DeepSpeed6_Fig8.png\" alt=\"Figure 8: Wikilingua task ROUGE scores. The 10 billion-parameter Z-code MoE model trained with DeepSpeed MoE (orange) is compared with mBART across four different language translation tracks: Spanish-English (ES-EN), Turkish-English (TR-EN), Russian-English (RU-EN), and Vietnamese-English (VI-EN).\"\/><figcaption>Figure 8: Wikilingua task ROUGE scores. The 10 billion-parameter Z-code MoE model trained with DeepSpeed MoE (orange) is compared with mBART across four different language translation tracks: Spanish-English (ES-EN), Turkish-English (TR-EN), Russian-English (RU-EN), and Vietnamese-English (VI-EN).<\/figcaption><\/figure><\/div>\n\n\n\n<p><strong>Growing Z-code MoE models even bigger and better.<\/strong> With DeepSpeed MoE, Z-code language models are being scaled even further. We are currently working to train a 200 billion-parameter version of the model. Ongoing scaling results continue to demonstrate the efficiency of the DeepSpeed MoE implementation and the effectiveness of the Z-code MoE model. We have observed over 20,000 updates performed in less than two days with 256 NVIDIA A100 GPUs. In contrast, a dense model of similar size would take approximately 24 days to run through the same number of updates on the same hardware. Furthermore, compared with the 10 billion-parameter version, Z-code MoE 200B uses similar training time and achieves higher accuracy. &nbsp;<\/p>\n\n\n\n<p>As demonstrated above, MoE models with fewer resources, excellent throughput, and near-linear scalability can achieve great results. We\u2019re confident that our memory-efficient, high-performance, and easy-to-use MoE implementation will help accelerate the development of your production models and help power the ambition of training multibillion- and multitrillion-parameter MoE models. Please visit the\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.deepspeed.ai\" target=\"_blank\" rel=\"noopener noreferrer\">DeepSpeed website<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/scalable-and-efficient-moe-training-for-multitask-multilingual-models\/\">paper<\/a>,\u00a0and the\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0for code, tutorials, and documentation on DeepSpeed MoE.<\/p>\n\n\n\n<h3 id=\"about-the-z-code-team\">About the Z-code Team:<\/h3>\n\n\n\n<p>The Z-code team comprises a group of researchers and engineers: Young&nbsp;Jin&nbsp;Kim, Alex Muzio, Felipe Cruz Salinas,&nbsp;Liyang&nbsp;Lu, Amr Hendy, Hany Hassan Awadalla (team lead)\u2014who are part of Azure AI and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/msturing.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Project Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;focusing on building multilingual, large-scale language models that support various production teams.<\/p>\n\n\n\n<h3 id=\"about-the-deepspeed-team\">About the DeepSpeed Team:<\/h3>\n\n\n\n<p>We are a group of system researchers and engineers\u2014Samyam Rajbhandari, Ammar Ahmad Awan, Conglong Li, Minjia Zhang, Jeff Rasley, Reza Yazdani Aminabadi, Elton Zheng, Cheng Li, Olatunji Ruwase, Shaden Smith, Arash Ashari, Niranjan Uma Naresh, Jeffrey Zhu, Yuxiong He (team lead)\u2014who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning\u2019s speed to train, speed to convergence, and speed to develop.<\/p>\n\n\n\n<p>If this type of work interests you, the&nbsp;DeepSpeed&nbsp;team is hiring both researchers and engineers! Please visit&nbsp;our&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/careers.microsoft.com\/us\/en\/search-results?keywords=deepspeed%20open%20source\" target=\"_blank\" rel=\"noopener noreferrer\">careers page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the DeepSpeed (opens in new tab) optimization library. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the Switch [&hellip;]<\/p>\n","protected":false},"author":40519,"featured_media":767644,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-766675","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[678390,649749,765364],"related-events":[],"related-researchers":[{"type":"guest","value":"deepspeed-team","user_id":"690909","display_name":"DeepSpeed Team","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/#!people\" aria-label=\"Visit the profile page for DeepSpeed Team\">DeepSpeed Team<\/a>","is_active":true,"last_first":"Team, DeepSpeed","people_section":0,"alias":"deepspeed-team"},{"type":"guest","value":"z-code-team","user_id":"766690","display_name":"Z-code Team  ","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-zcode\/#!people\" aria-label=\"Visit the profile page for Z-code Team  \">Z-code Team  <\/a>","is_active":true,"last_first":"Z-code Team ","people_section":0,"alias":"z-code-team"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-960x540.jpg\" class=\"img-object-cover\" alt=\"DeepSpeed\u00a0MoE\u00a0powers eight times bigger models using expert-parallelism +\u00a0ZeRO-Offload compared with expert-parallelism only. A graph shows supported model sizes on NVIDIA A100 GPUs.\u00a0DeepSpeed\u00a0MoE\u00a0scales near-linearly with respect to the number of GPUs.\u00a0Z-code\u00a0MoE\u00a0(10B) consistently outperforms other systems on BLEU scores for an in-house 50 language test dataset. Read more in the blog post.\u00a0\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-1536x864.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-2048x1152.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_Deepspeed_MOE_blog_no_logo_v3-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/#!people\" title=\"Go to researcher profile for DeepSpeed Team\" aria-label=\"Go to researcher profile for DeepSpeed Team\" data-bi-type=\"byline author\" data-bi-cN=\"DeepSpeed Team\">DeepSpeed Team<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-zcode\/#!people\" title=\"Go to researcher profile for Z-code Team  \" aria-label=\"Go to researcher profile for Z-code Team  \" data-bi-type=\"byline author\" data-bi-cN=\"Z-code Team  \">Z-code Team  <\/a>","formattedDate":"August 18, 2021","formattedExcerpt":"Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the DeepSpeed (opens in new tab) optimization library. MoE models are an emerging class of sparsely activated models that have sublinear compute&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/766675","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40519"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=766675"}],"version-history":[{"count":23,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/766675\/revisions"}],"predecessor-version":[{"id":782704,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/766675\/revisions\/782704"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/767644"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=766675"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=766675"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=766675"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=766675"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=766675"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=766675"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=766675"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=766675"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=766675"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=766675"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=766675"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}