{"id":798607,"date":"2021-11-22T09:40:53","date_gmt":"2021-11-22T17:40:53","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=798607"},"modified":"2021-11-22T09:40:54","modified_gmt":"2021-11-22T17:40:54","slug":"tutel-an-efficient-mixture-of-experts-implementation-for-large-dnn-model-training","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/tutel-an-efficient-mixture-of-experts-implementation-for-large-dnn-model-training\/","title":{"rendered":"Tutel: An efficient mixture-of-experts implementation for large DNN model training"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-scaled.jpg\" alt=\"A line graph comparing the end-to-end performance of Meta\u2019s MoE language model using Azure NDm A100 v4 nodes with and without Tutel. The x-axis is the number of A100 (80GB) GPUs, beginning at 8 and going up to 512, and the y-axis is the throughput (K tokens\/s), beginning with 0 and going up to 1,000 in intervals of 100. Tutel always achieves higher throughput than fairseq. \"\/><\/figure>\n\n\n\n<p>Mixture of experts (MoE) is a deep learning model architecture in which computational cost is sublinear to the number of parameters, making scaling easier. Nowadays, MoE is the only approach demonstrated to scale deep learning models to trillion-plus parameters, paving the way for models capable of learning even more information and powering computer vision, speech recognition, natural language processing, and machine translation systems, among others, that can help people and organizations in new ways.<\/p>\n\n\n\n<p>Today, we&#8217;re proud to announce <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/tutel\">Tutel, a high-performance MoE library<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to facilitate the development of large-scale DNN models; Tutel is highly optimized for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/microsoft-expands-its-aisupercomputer-lineup-with-general-availability-of-the-latest-80gb-nvidia-a100-gpus-in-azure-claims\/\">the new Azure NDm A100 v4 series, now generally available<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. With Tutel\u2019s diverse and flexible MoE algorithmic support, developers across AI domains can execute MoE more easily and efficiently. For a single MoE layer, Tutel achieves an 8.49x speedup on an NDm A100 v4 node with 8 GPUs and a 2.75x speedup on 64 NDm A100 v4 nodes with 512 A100 GPUs (all experiments in this blog are tested on Azure NDm A100 v4 nodes with 8 x 80 GB NVIDIA A100 and an 8 x 200 gigabits per second InfiniBand network), respectively, compared with state-of-the-art MoE implementations such as that in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/pytorch\/fairseq\/tree\/moe\">Meta\u2019s Facebook AI Research Sequence-to-Sequence Toolkit (fairseq)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in PyTorch. For end-to-end performance, Tutel\u2014benefiting from an optimization for all-to-all communication\u2014achieves a more than 40 percent speedup with 64 NDm A100 v4 nodes for Meta\u2019s (Facebook is now Meta) 1.1 trillion\u2013parameter MoE language model. Tutel provides great compatibility with rich features to ensure the great performance when working on the Azure NDm A100 v4 cluster. Tutel is open source and has been integrated into fairseq.<\/p>\n\n\n\n<h2 id=\"tutel-moe-optimizations\">Tutel MoE optimizations<\/h2>\n\n\n\n<p>Complementary to other high-level MoE solutions like fairseq and FastMoE, Tutel mainly focuses on the optimizations of MoE-specific computation and the all-to-all communication, as well as other diverse and flexible algorithmic MoE supports. Tutel has a concise interface, making it easy to integrate into other MoE solutions. Alternatively, developers can use the Tutel interface to incorporate standalone MoE layers into their own DNN models from scratch and benefit from the highly optimized state-of-the-art MoE features directly.<\/p>\n\n\n\n<h3 id=\"moe-specific-optimization-for-computation\">MoE-specific optimization for computation<\/h3>\n\n\n\n<p>Because of the lack of efficient implementations, MoE-based DNN models rely on a naive combination of multiple off-the-shelf DNN operators provided by deep learning frameworks such as PyTorch and TensorFlow to compose the MoE computation. Such a practice incurs significant performance overheads thanks to redundant computation. Tutel designs and implements multiple highly optimized GPU kernels to provide operators for MoE-specific calculation. For example, Tutel reduces the time complexity of dispatching \u201cgating output\u201d from O(N^3) to O(N^2), which significantly improves the data dispatching efficiency. Tutel also implements a fast cumsum-minus-one operator, achieving a 24x speedup compared with the fairseq implementation. Tutel also leverages <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/docs.nvidia.com\/cuda\/nvrtc\/index.html\">NVRTC<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a runtime compilation library for CUDA C++, to further optimize the customized MoE kernel just-in-time. Figure 1 shows the comparison results of Tutel with fairseq on the Azure NDm A100 v4 platform, where\u2014as mentioned above\u2014a single MoE layer with Tutel achieves an 8.49x speedup on 8 A100 GPUs and a 2.75x speedup on 512 A100 GPUs.<\/p>\n\n\n\n<h3 id=\"underlying-all-to-all-communication-optimization-on-azure-ndm-a100-v4-clusters\">Underlying all-to-all communication optimization on Azure NDm A100 v4 clusters<\/h3>\n\n\n\n<p>Tutel also optimizes the all-to-all collective communication for large-scale MoE training on Azure NDm A100 v4 clusters, including CPU-GPU binding and adaptive routing (AR) tuning. A proper CPU-GPU binding on a multi-non-uniform memory access (NUMA) system, especially on the NDm A100 v4 nodes, is very critical for all-to-all performance. Unfortunately, existing machine learning frameworks have not provided an efficient all-to-all communication library, resulting in performance regression for large-scale distributed training. Tutel optimizes the binding automatically and provides an elegant interface for user fine-tuning. Furthermore, Tutel leverages multipath technology, namely AR, on NDm A100 v4 clusters. For the all-to-all communication in MoE, the total data traffic size of the communication for each GPU doesn\u2019t change, but the data size between each GPU pair becomes smaller with the increasing number of GPUs. Smaller data size incurs a larger overhead in the all-to-all communication, leading to poorer MoE training performance. By taking advantage of AR technology available on NDm A100 v4 clusters, Tutel improves communication efficiency for groups of small messages and provides high-performance all-to-all communication on NDm A100 v4 systems. Benefiting from CPU-GPU binding and AR tuning, Tutel achieves a 2.56x to 5.93x all-to-all speedup with 512 A100 GPUs for message sizes hundreds of MiB large, which are typically used in MoE training, as illustrated in Figure 2.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"385\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1-1024x385.png\" alt=\"On the left is a line graph comparing the single MoE layer training performance of Tutel, represented by a blue line, and fairseq, represented by a light gray line. The x-axis is the number of A100 (80GB) GPUs, beginning with 8 GPUs and going up to 512 GPUs, and the y-axis is the throughput (M tokens\/s), beginning with 0 and increasing in intervals of 10 up to 90. The throughput increases at a faster rate for Tutel than fairseq as the number of GPUs increases. On the right, a line graph with all-to-all message size, ranging from 1 MiB to 8 GiB, on the x-axis and bus bandwidth (GB\/s), starting at 0 and going up to 20 in increments of 5, on the y-axis. An orange line represents the original all-to-all performance in PyTorch, and a blue line represents the performance after CPU-GPU binding and AR optimization in Tutel. The blue line is always higher\/faster than the orange line, especially for message sizes hundreds of MB large.  \" class=\"wp-image-798667\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1-1024x385.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1-300x113.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1-768x289.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1-240x90.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/TutelFairseqBlog_Updated-fig1.png 1393w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1 (left): Compared to fairseq, for a single MoE layer, Tutel achieves an 8.49x speedup on an NDm A100 v4 node with 8 GPUs and a 2.75x speedup on 64 NDm A100 v4 nodes with 512 A100 GPUs. The detailed setting is as follows: batch_size = 32, sequence_length = 1,024, Top_K = 2, model_dim = 2,048, and hidden_size = 2,048. Figure 2 (right): The all-to-all bandwidth for different message sizes with 64 NDm A100 v4 nodes (512 A100 GPUs) before and after applying Tutel. Tutel achieves a 2.56x to 5.93x all-to-all speedup with 512 A100 GPUs for message sizes hundreds of MiB large.<\/figcaption><\/figure><\/div>\n\n\n\n<h3 id=\"diverse-and-flexible-moe-algorithms-support\">Diverse and flexible MoE algorithms support<\/h3>\n\n\n\n<p>Tutel provides diverse and flexible support for state-of-the-art MoE algorithms, including support for:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>the arbitrary K setting for the Top-K gating algorithm (most implementations only support Top-1 and Top-2).<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2101.03961\">different exploration strategies<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, including batch-prioritized routing, input dropout, and input jitter<\/li><li>different levels of precisions, including half precision (FP16), full precision (FP32), and mixed precision (we\u2019ll support BF16 in our next release)<\/li><li>different types of devices, including both NVIDIA CUDA and AMD ROCm devices<\/li><\/ul>\n\n\n\n<p>Tutel will be actively integrating various emerging MoE algorithms from the open-source community.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1141385\">\n\t\t\n\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/ai.azure.com\/labs\" aria-label=\"Azure AI Foundry Labs\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Azure-AI-Foundry_1600x900.jpg\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Azure AI Foundry Labs<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"azure-ai-foundry-labs\" class=\"large\">Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/ai.azure.com\/labs\" aria-describedby=\"azure-ai-foundry-labs\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t\t\tAzure AI Foundry\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"integrating-tutel-with-metas-moe-language-model\">Integrating Tutel with Meta\u2019s MoE language model<\/h2>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/pytorch\/fairseq\/tree\/moe\">Meta made its MoE language model open source<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and uses fairseq for its MoE implementation. We worked with Meta to integrate Tutel into the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/pytorch\/fairseq\">fairseq toolkit<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Meta has been using Tutel to train its large language model, which has an attention-based neural architecture similar to GPT-3, on Azure NDm A100 v4. We use Meta\u2019s language model to evaluate the end-to-end performance of Tutel. The model has 32 attention layers, each with 32 x 128-dimension heads. Every two layers contains one MoE layer, and each GPU has one expert. Table 1 summarizes the detailed parameter setting of the model, and Figure 3 shows the 40 percent speedup Tutel achieves. With the increasing number of GPUs, the gain from Tutel is from 131 percent with 8 A100 GPUs to 40 percent with 512 A100 GPUs because the all-to-all communication becomes the bottleneck. We\u2019ll do further optimization on the next version.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Configuration <\/td><td>Setting<\/td><td>Configuration <\/td><td>Setting<\/td><\/tr><tr><td>code branch<\/td><td><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/pytorch\/fairseq\/tree\/moe-benchmark\">moe-benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><td>Git commit ID<\/td><td>1ef1612<\/td><\/tr><tr><td>decorder-layers<\/td><td>32<\/td><td>Arch<\/td><td>transformer_lm_gpt<\/td><\/tr><tr><td>decoder-attention-heads<\/td><td>32<\/td><td>Criterion<\/td><td>moe_cross_entropy<\/td><\/tr><tr><td>decoder-embed-dim<\/td><td>4096<\/td><td>moe-freq<\/td><td>2<\/td><\/tr><tr><td>decorder-ffn-embed-dim<\/td><td>16384<\/td><td>moe-expert-count<\/td><td>512<\/td><\/tr><tr><td>tokens-per-sample<\/td><td>1024<\/td><td>moe-gating-use-fp32<\/td><td>True<\/td><\/tr><tr><td>Batch-size<\/td><td>24<\/td><td>Optimizer<\/td><td>Adam<\/td><\/tr><tr><td>vocabulary size <\/td><td>51200<\/td><td>fp16-adam-stats<\/td><td>True <\/td><\/tr><\/tbody><\/table><figcaption>Table 1: Configuration for MoE language model with 512 A100 (80G) GPUs<\/figcaption><\/figure>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logo-2-scaled.jpg\" alt=\"A line graph comparing the end-to-end performance of Meta\u2019s MoE language model using Azure NDm A100 v4 nodes with and without Tutel. The x-axis is the number of A100 (80GB) GPUs, beginning at 8 and going up to 512, and the y-axis is the throughput (K tokens\/s), beginning with 0 and going up to 1,000 in intervals of 100. Tutel always achieves higher throughput than fairseq. \"\/><figcaption>Figure 3: For end-to-end performance, Tutel achieves a more than 40 percent speedup with 64 NDm A100 v4 nodes for Meta\u2019s 1.1 trillion\u2013parameter MoE language model.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"the-promise-of-moe\">The promise of MoE<\/h2>\n\n\n\n<p>MoE is a promising technology. It enables holistic training based on techniques from many areas, such as systematic routing and network balancing with massive nodes, and can even benefit from GPU-based acceleration. We demonstrate an efficient MoE implementation, Tutel, that resulted in significant gain over the fairseq framework. Tutel has been integrated into the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\">DeepSpeed <\/a>framework, as well, and we believe that Tutel and related integrations will benefit Azure services, especially for those who want to scale their large models efficiently. As today\u2019s MoE is still in its early stages and more efforts are needed to realize its full potential, Tutel will continue evolving and bring us more exciting results.<\/p>\n\n\n\n<h3 id=\"acknowledgment\">Acknowledgment<\/h3>\n\n\n\n<p>The research behind Tutel was conducted by a team of researchers from across Microsoft, including Wei Cui, Zilong Wang, Yifan Xiong, Guoshuai Zhao, Fan Yang, Peng Cheng, Yongqiang Xiong, Mao Yang, Lidong Zhou, Rafael Salas, Jithin Jose, Kushal Datta, Prabhat Ram, and Joe Chau.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mixture of experts (MoE) is a deep learning model architecture in which computational cost is sublinear to the number of parameters, making scaling easier. Nowadays, MoE is the only approach demonstrated to scale deep learning models to trillion-plus parameters, paving the way for models capable of learning even more information and powering computer vision, speech [&hellip;]<\/p>\n","protected":false},"author":40735,"featured_media":798664,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-798607","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[678390],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Wei Cui","user_id":38859,"display_name":"Wei Cui","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/weicu\/\" aria-label=\"Visit the profile page for Wei Cui\">Wei Cui<\/a>","is_active":false,"last_first":"Cui, Wei","people_section":0,"alias":"weicu"},{"type":"user_nicename","value":"Peng Cheng","user_id":33225,"display_name":"Peng Cheng","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pengc\/\" aria-label=\"Visit the profile page for Peng Cheng\">Peng Cheng<\/a>","is_active":false,"last_first":"Cheng, Peng","people_section":0,"alias":"pengc"},{"type":"guest","value":"rafael-salas","user_id":"798616","display_name":"Rafael  Salas","author_link":"Rafael  Salas","is_active":true,"last_first":"Salas, Rafael ","people_section":0,"alias":"rafael-salas"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"A line graph comparing the end-to-end performance of Meta\u2019s MoE language model using Azure NDm A100 v4 VMs with and without Tutel. The x-axis is the number of A100 (80GB) GPUs, beginning at 8 and going up to 512, and the y-axis is the throughput (K tokens\/s), beginning with 0 and going up to 1,000 in intervals of 100. Tutel always achieves higher throughput than fairseq.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Azure_tutel_Still_no_logov2-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/weicu\/\" title=\"Go to researcher profile for Wei Cui\" aria-label=\"Go to researcher profile for Wei Cui\" data-bi-type=\"byline author\" data-bi-cN=\"Wei Cui\">Wei Cui<\/a>, Yifan Xiong, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pengc\/\" title=\"Go to researcher profile for Peng Cheng\" aria-label=\"Go to researcher profile for Peng Cheng\" data-bi-type=\"byline author\" data-bi-cN=\"Peng Cheng\">Peng Cheng<\/a>, and Rafael  Salas","formattedDate":"November 22, 2021","formattedExcerpt":"Mixture of experts (MoE) is a deep learning model architecture in which computational cost is sublinear to the number of parameters, making scaling easier. Nowadays, MoE is the only approach demonstrated to scale deep learning models to trillion-plus parameters, paving the way for models capable&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/798607","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=798607"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/798607\/revisions"}],"predecessor-version":[{"id":798697,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/798607\/revisions\/798697"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/798664"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=798607"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=798607"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=798607"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=798607"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=798607"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=798607"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=798607"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=798607"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=798607"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=798607"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=798607"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}