{"id":649749,"date":"2020-05-19T08:01:11","date_gmt":"2020-05-19T15:01:11","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=649749"},"modified":"2024-09-09T08:40:22","modified_gmt":"2024-09-09T15:40:22","slug":"ai-at-scale","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-at-scale\/","title":{"rendered":"AI at Scale"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"720\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720.jpg\" class=\"attachment-full size-full\" alt=\"Project Turing header: electric pulse on black background\" style=\"object-position: 53% 52%\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720-300x113.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720-1024x384.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720-768x288.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720-1536x576.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/Project-Turing-AI_header_1920x720-1600x600.jpg 1600w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading h2\" id=\"ai-at-scale\">AI at Scale<\/h1>\n\n\n\n<p>Models, infrastructure and hardware for next-generation AI applications<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-does-ai-at-scale-matter\">Why does AI at scale matter?<\/h2>\n\n\n\n<p>Microsoft\u2019s AI at Scale initiative is pioneering a new approach that will result in next-generation AI capabilities that are scaled across the company\u2019s products and AI platforms. Building on years of systems work by Microsoft researchers, particularly in the area of <strong>parallel computation<\/strong>, AI at Scale makes it possible to <strong>quickly train machine learning models at an unprecedented scale<\/strong>. This includes developing&nbsp;<strong>a new class of large, centralized AI models<\/strong>&nbsp;that can be scaled and specialized across product domains, as well as creating&nbsp;<strong>state-of-the-art hardware and infrastructure<\/strong>&nbsp;to power this new class of models.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h4 class=\"wp-block-heading\" id=\"onnx-integration\">ONNX Integration<\/h4>\n\n\n\n<p>AI at Scale capabilities, including DeepSpeed, have been integrated into the ONNX (Open Neural Network Exchange) runtime to add distributed training support for machine learning models that is framework-agnostic and hardware-agnostic.<\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/onnxruntime\/\" target=\"_blank\" rel=\"noopener noreferrer\">Get the ONNX code><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/onnxruntime-training-examples\" target=\"_blank\" rel=\"noopener noreferrer\">Explore training examples ><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h4 class=\"wp-block-heading\" id=\"project-parasail\">Project Parasail<\/h4>\n\n\n\n<p>Pioneering a novel approach to parallelizing a large class of seemingly sequential applications, particularly stochastic gradient descent.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/parasail\/\">More on Project Parasail ><\/a><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h4 class=\"wp-block-heading\" id=\"project-fiddle\">Project Fiddle<\/h4>\n\n\n\n<p>Pipeline parallelism is a novel approach to model training to overcome the higher communication costs of data parallelism and the hardware resource inefficiency of model parallelism.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/fiddle\/\">More on Project Fiddle ><\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/pipedream-a-more-effective-way-to-train-deep-neural-networks-using-pipeline-parallelism\/\">Read the blog ><\/a><\/p>\n<\/div>\n<\/div>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row has-background has-blue-20-background-color has-text-color has-black-color wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__wrapper\">\n\t\t\t<div class=\"wp-block-media-text has-vertical-margin-small  has-vertical-padding-none  is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/preprod\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1024x576.jpg\" alt=\"DeepSpeed multi GPU inference offers up to 6.9 times throughput improvement for large deep learning model inference. Progressive Layer Dropping offers 2.8 times faster convergence for large model training. 1-bit LAMB offers up to 4.6 times less communication overhead. Single GPU speedups for inference: 2.1 times on BERT Base, 4.4 times on BERT Large, 3.8 times on GPT 2, 3.5 times on GPT 2 XL, 1.9 times on GPT Neo. Multi GPU speedups for inference: 6.2 times for Turing NLG, 3.7 times for 175 billion parameter language model.\" class=\"wp-image-747997 size-full\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1536x864.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-2048x1152.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_deepspeed_no_logo_still-4-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<h3 class=\"wp-block-heading\" id=\"deepspeed-for-large-model-training\">DeepSpeed for large model training<\/h3>\n\n\n\n<p>DeepSpeed is an open-source PyTorch-compatible library that vastly improves large model training by improving scale, speed, cost and usability\u2014unlocking the ability to train models with over 100 billion parameters enabling breakthroughs in areas such as natural language processing (NLP), and multi-modality (combining language with other types of data, such as images, video, and speech).<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-powers-8x-larger-moe-model-training-with-high-performance\/\">Learn more about the latest DeepSpeed updates ><\/a><\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" target=\"_blank\" rel=\"noreferrer noopener\">Download DeepSpeed<\/a><\/div>\n<\/div>\n<\/div><\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-top is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<h2 class=\"wp-block-heading\" id=\"advances-in-natural-language-processing\">Advances in natural language processing<\/h2>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<p>The Turing Natural Language Generation (T-NLG) is a 17-billion parameter language model that outperforms the state-of-the-art on many downstream NLP tasks. In particular, it can enhance the Microsoft Office experience through writing assistance and answering reader questions paving the way for more fluent digital assistants.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/\" data-bi-cN=\"Turing-NLG: A 17-billion-parameter language model by Microsoft\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Turing-NLG: A 17-billion-parameter language model by Microsoft<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t\t\t<p class=\"annotations__caption text-neutral-400 mt-2\">February 2020<\/p>\n\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/blogs.bing.com\/search-quality-insights\/september-2020\/Introducing-the-next-wave-of-AI-at-Scale-innovations-in-Bing\" data-bi-cN=\"Introducing the next wave of AI at Scale innovations in Bing\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Introducing the next wave of AI at Scale innovations in Bing<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t\t\t<p class=\"annotations__caption text-neutral-400 mt-2\">September 2020<\/p>\n\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p>On the multi-modality language-image front, we\u2019ve significantly outperformed the state-of-the-art on downstream language-image tasks (e.g. visual search) with Oscar (<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/\">Object-Semantics Aligned Pre-training<\/a>).<\/p>\n\n\n\n<p>Recently, pre-trained models such as&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/unicoder-a-universal-language-encoder-by-pre-training-with-multiple-cross-lingual-tasks\/\">Unicoder<\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/google-research\/bert\/blob\/master\/multilingual.md\" target=\"_blank\" rel=\"noopener noreferrer\">M-BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/facebookresearch\/XLM\" target=\"_blank\" rel=\"noopener noreferrer\">&nbsp;XLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;have been developed to learn multilingual representations for cross-lingual and multilingual tasks. By performing masked language model, translation language model, and other bilingual pre-training tasks on multilingual and bilingual corpora with shared vocabulary and weights for multiple languages, these models obtain surprisingly good cross-lingual capability. However, the community still lacks benchmark datasets to evaluate such capability.&nbsp;To help researchers further advance language-agnostic models and make AI systems more inclusive, the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/xglue-expanding-cross-lingual-understanding-and-generation-with-tasks-from-real-world-scenarios\/\">XGLUE<\/a> dataset helps researchers test a language model\u2019s zero-shot cross-lingual transfer capability \u2013 its ability to transfer what it learned in English to the same task in other languages.<\/p>\n\n\n\n<p>We are incorporating these breakthroughs into the company\u2019s products, including Bing, Office, Dynamics, and Xbox. Read this&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AA87dvg\" target=\"_blank\" rel=\"noopener noreferrer\">blog post<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;to learn more.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/XGLUE\" target=\"_blank\" rel=\"noreferrer noopener\">Download XGLUE dataset<\/a><\/div>\n<\/div>\n\n\n\n<div style=\"height:5px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"new-hardware-for-deep-learning\">New hardware for deep learning<\/h2>\n\n\n\n<figure class=\"wp-block-image alignright size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"574\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-1024x574.jpg\" alt=\"Azure Accelerated Machine Learning with Project Brainwave\" class=\"wp-image-487142\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-1024x574.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-300x168.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-768x431.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/brainwave.jpg 1263w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-brainwave\/\">Project Brainwave\u2019s<\/a> hardware boasts a soft Neural Processing Unit (NPU), based on a high-performance field-programmable gate array (FPGA), which accelerates deep neural network (DNN) inferencing, making it ideal for applications in computer vision and natural language processing. This approach is transforming computing by augmenting CPUs with an interconnected and configurable compute layer composed of programmable silicon.<\/p>\n\n\n\n<p>With a high-performance, precision-adaptable FPGA soft processor, Microsoft datacenters can serve pre-trained DNN models with high efficiencies at low batch sizes.<\/p>\n\n\n\n<p>Exploiting FPGAs on a datacenter-scale compute fabric, a single DNN model can be deployed as a scalable hardware microservice that leverages multiple FPGAs to create web-scale services. This can process massive amounts of dynamic data.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Article<\/span>\n\t\t\t<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/how-to-deploy-fpga-web-service\" data-bi-cN=\"Deploy ML models to field-programmable gate arrays (FPGAs) with Azure Machine Learning\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Deploy ML models to field-programmable gate arrays (FPGAs) with Azure Machine Learning<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Article<\/span>\n\t\t\t<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databox-online\/azure-stack-edge-overview\" data-bi-cN=\"What is Azure Stack Edge Pro FPGA?\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>What is Azure Stack Edge Pro FPGA?<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spell-correction-at-scale\">Spell correction at scale<\/h2>\n\n\n\n<p>Customers around the world use Microsoft products in over 100 languages, yet most do not come with high-quality spell correction. This prevents customers from maximizing their ability to search for information on the web and enterprise\u2014and even to author content. With AI at Scale, we used deep learning along with language families to solve this problem for customers by building what we believe is the most comprehensive and accurate spelling correction system ever in terms of language coverage and accuracy.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages\/\" data-bi-cN=\"Speller100: Zero-shot spelling correction at scale for 100-plus languages\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Speller100: Zero-shot spelling correction at scale for 100-plus languages<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t\t\t<p class=\"annotations__caption text-neutral-400 mt-2\">February 2021<\/p>\n\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row has-background has-light-gray-background-color has-text-color has-black-color wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__wrapper col-lg-11 col-xl-9 px-0 m-auto\">\n\t\t\t<div class=\"wp-block-media-text has-vertical-margin-small  has-vertical-padding-none  is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/preprod\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-1024x576.png\" alt=\"a close up of a Turing keyboard\" class=\"wp-image-508793 size-full\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/DeepSolvingWithSPACER_AlanTuring_AI_Site_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<h2 class=\"wp-block-heading\" id=\"learn-more\">Learn more<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/innovation.microsoft.com\/en-us\/ai-at-scale\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more about the AI at Scale initiative<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/AIS-DeepDive\" target=\"_blank\" rel=\"noopener noreferrer\">Read this technology deep dive<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/forms.microsoft.com\/Pages\/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7r1NVU5s6lGuA81eKEudjNUMFRLUFJOQUVHUDlORkZYQkg1RDVDOFQ2TS4u\" target=\"_blank\" rel=\"noopener noreferrer\">Nominate your organization for a private preview of Semantic Search by Project Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.msturing.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Visit the Microsoft Project Turing site<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n<\/div><\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n\n\t<div class=\"wp-block-msr-block-journey journey journey--date alignwide\" data-bi-aN=\"block-journey\">\n\t\t<ol class=\"journey__list\">\n\t\t\t\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tSep\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"microsoft-turing-universal-language-representation-model-t-ulrv5-tops-xtreme-leaderboard-and-trains-100x-faster\">Microsoft Turing Universal Language Representation model, T-ULRv5, tops XTREME leaderboard and trains 100x faster<\/h3>\n\n\n\n<p>Our latest Turing universal language representation model (T-ULRv5), a Microsoft-created model is once again the state of the art and <strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/sites.research.google\/xtreme\" target=\"_blank\" rel=\"noopener noreferrer\">at the top of the Google XTREME public leaderboard<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong>. Resulting from a collaboration between the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/turing.microsoft.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> team and Microsoft Research, the 2.2 billion-parameter T-ULRv5 XL outperforms the current 2nd best model by an average score of 1.7 points. It is also the state of the art across each of the four subcategories of tasks on the leaderboard. These results demonstrate the strong capabilities of T-ULRv5, which, in addition to being more capable, trains 100 times faster than its predecessors.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tAug\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"make-every-feature-binary-meb\">Make Every feature Binary (MEB)<\/h3>\n\n\n\n<p>The Microsoft Bing team developed and operationalized \u201cMake Every feature Binary\u201d (MEB), a large-scale sparse model that complements production Transformer models to improve search relevance. To make search more accurate and dynamic, MEB is a 135B parameter model which harnesses the power of large data and allows for an input feature space with over 200 billion binary features that reflect the subtle relationships between search queries and documents.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance\/\" data-bi-cN=\"Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"deepspeed-powers-8x-larger-moe-model-training-with-high-performance\">DeepSpeed powers 8x larger MoE model training with high performance<\/h3>\n\n\n\n<p>DeepSpeed continues to innovate, enabling mixture of experts model training with bigger size, fewer resources, excellent throughput, and near-linear scalability. It combines multidimensional parallelism and heterogenous memory technologies to support massively large MoE models. Model scientists of Microsoft use DeepSpeed MoE to train <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/\">Z-code<\/a> MoE, a production-quality, multi-lingual, and multi-task language model with 10 billion parameters, achieving state-of-the-art results on machine translation and cross-lingual summarization tasks.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-powers-8x-larger-moe-model-training-with-high-performance\/\" data-bi-cN=\"DeepSpeed powers 8x larger MoE model training with high performance\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed powers 8x larger MoE model training with high performance<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/scalable-and-efficient-moe-training-for-multitask-multilingual-models\/\" data-bi-cN=\"Scalable and Efficient MoE Training for Multitask Multilingual Models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Scalable and Efficient MoE Training for Multitask Multilingual Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tutorial<\/span>\n\t\t\t<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/mixture-of-experts\" data-bi-cN=\"Mixture of Experts\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Mixture of Experts<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tMay\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"deepspeed-3-times-faster-5-times-cheaper-inference-for-large-dl-models\">DeepSpeed: 3 times faster, 5 times cheaper inference for large DL models<\/h3>\n\n\n\n<p>The DeepSpeed library has improved scale, speed, cost, and usability for large model training by orders of magnitude. Now, the team introduces DeepSpeed Inference \u2014 with high-performance multi-GPU inference and mixture of quantization \u2014 to significantly reduce the latency and cost of serving large DL models. The team also announce a suite of new features for compressed training, such as progressive layer dropping and 1-bit LAMB, to achieve fast and accurate training with low cost.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression\/\" data-bi-cN=\"DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tApr\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"zero-infinity\">ZeRO-Infinity<\/h3>\n\n\n\n<figure class=\"wp-block-image alignright size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1024x576.jpg\" alt=\"DeepSpeed figure\" class=\"wp-image-741007\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The DeepSpeed Team releases ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40\\% of peak), while also demonstrating super linear scalability.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training\/\" data-bi-cN=\"ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tool<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" data-bi-cN=\"DeepSpeed\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-infinity-breaking-the-gpu-memory-wall-for-extreme-scale-deep-learning\/\" data-bi-cN=\"ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tMar\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"the-best-of-ai-at-scale-semantic-search-capabilities-available-to-azure-customers-in-preview\">The best of AI at Scale: Semantic search capabilities available to Azure customers in preview<\/h3>\n\n\n\n<p>Microsoft Bing partnered with Azure Cognitive Search to make state-of-the-art search AI available to Azure customers through semantic search. Semantic search enables modern search experiences such as semantic ranking, extractive summarization, and machine reading comprehension. These features were built through the application of Microsoft Research technology and advancements\u2014including UniLM, Multi-Task Deep Neural Networks, MiniLM, and utilizing graph attention networks for machine reading comprehension\u2014to search scenarios. Deep neural network transfer learning allows the models to run well in Azure. An online A\/B experiment in which semantic search was enabled for Microsoft Docs produced a significant 4.5 percent clickthrough rate increase on challenging queries (three or more words)\u2014the largest relevance improvement the Microsoft Docs team has seen.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-science-behind-semantic-search-how-ai-from-bing-is-powering-azure-cognitive-search\/\" data-bi-cN=\"The science behind semantic search: How AI from Bing is powering Azure Cognitive Search\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>The science behind semantic search: How AI from Bing is powering Azure Cognitive Search<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tFeb\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"spell-correction-at-scale\">Spell correction at scale<\/h3>\n\n\n\n<p>Customers around the world use Microsoft products in over 100 languages, yet most do not come with high-quality spell correction. This prevents customers from maximizing their ability to search for information on the web and enterprise\u2014and even to author content. With AI at Scale, we used deep learning along with language families to solve this problem for customers by building what we believe is the most comprehensive and accurate spelling correction system ever in terms of language coverage and accuracy.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages\/\" data-bi-cN=\"Speller100: Zero-shot spelling correction at scale for 100-plus languages\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Speller100: Zero-shot spelling correction at scale for 100-plus languages<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2021\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tJan\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"deberta\">DeBERTa<\/h3>\n\n\n\n<p>Updates to the Transformer-based DeBERTa neural language model boost its performance, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/\">topping the SuperGLUE and GLUE benchmark leaderboards<\/a>. The updates include training a larger version of the model that contains 48 Transformer layers and 1.5 billion parameters. The updated single version of DeBERTa surpasses human performance on the SuperGLUE benchmark for the first time based on macro-average score, while the ensemble model outperforms the single version to top both leaderboards. DeBERTa is being incorporated into the next iteration of the Microsoft Turing natural language representation model, Turing NLRv4.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/\" data-bi-cN=\"Microsoft DeBERTa surpasses human performance on the SuperGLUE benchmark\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Microsoft DeBERTa surpasses human performance on the SuperGLUE benchmark<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tool<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/DeBERTa\" data-bi-cN=\"DeBERTa\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeBERTa<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deberta-decoding-enhanced-bert-with-disentangled-attention-2\/\" data-bi-cN=\"DeBERTa: Decoding-Enhanced BERT with Disentangled Attention\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeBERTa: Decoding-Enhanced BERT with Disentangled Attention<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"vinvl\">VinVL<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"309\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-1024x309.jpg\" alt=\"VinVL graphic\" class=\"wp-image-717265\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-1024x309.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-300x90.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-768x231.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-16x5.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1.jpg 1251w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Researchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed VinVL (Visual features in Vision-Language), and performed a comprehensive empirical study to show that visual features matter significantly in VL models.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/\" data-bi-cN=\"VinVL: Advancing the state of the art for vision-language models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>VinVL: Advancing the state of the art for vision-language models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/vinvl-making-visual-representations-matter-in-vision-language-models\/\" data-bi-cN=\"VinVL: Revisiting Visual Representations in Vision-Language Models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>VinVL: Revisiting Visual Representations in Vision-Language Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tDec\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"project-brainwave-microsoft-floating-point\">Project Brainwave + Microsoft Floating Point<\/h3>\n\n\n\n<p>Microsoft researchers and engineers <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-microsoft-custom-data-type-for-efficient-inference\/\">announced Microsoft Floating Point<\/a>, a data type that brings together the efficiency of integer data types with accuracy comparable to floating point. It\u2019s being used in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-brainwave\/\">Project Brainwave<\/a> architecture to power real-time production-scale deep neural network inference in the cloud, and it enables features in many Microsoft products, including Office 365 and Bing. Project Brainwave architecture\u2014to be turbocharged by silicon-hardened Microsoft Floating Point\u2014is expected to have a pivotal role in the future of algorithm codesign in hardware.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-microsoft-custom-data-type-for-efficient-inference\/\" data-bi-cN=\"A Microsoft custom data type for efficient inference\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>A Microsoft custom data type for efficient inference<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point\/\" data-bi-cN=\"Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tNov\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"sc-gpt-model-using-few-shot-natural-language-generation-for-task-oriented-dialog\">SC-GPT model: Using few-shot natural language generation for task-oriented dialog<\/h3>\n\n\n\n<p>In this work, Microsoft researchers set out to improve generalization with limited labelled data for natural language generation (NLG) models. To do so, they developed the first NLG benchmark to simulate few-shot learning in task-oriented dialog systems. They also created the SC-GPT model, a multi-layer Transformer neural language model, which generates semantically controlled responses conditioned on a given semantic form and requires much fewer domain labels to generalize to new domains.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/few-shot-natural-language-generation-for-task-oriented-dialog\/\" data-bi-cN=\"Few-shot Natural Language Generation for Task-Oriented Dialog\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Few-shot Natural Language Generation for Task-Oriented Dialog<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tOct\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"turing-universal-language-representation-model-takes-top-spot-on-xtreme-leaderboard\">Turing Universal Language Representation model takes top spot on XTREME leaderboard<\/h3>\n\n\n\n<p>TULRv2, the Turing Universal Language Representation model for cross-lingual generalization, tops the Google XTREME leaderboard. The model uses another recent Microsoft innovation, InfoXLM, to create a universal model that represents 94 languages in the same vector space, and it is being used to power features in Microsoft Word, Outlook, and Teams. The XTREME leaderboard covers 40 languages spanning 12 language families, and it challenges models to reason about syntax and semantics at varying levels.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tSep\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"bing-announces-updates-that-make-use-of-turing-natural-language-generation-and-expanded-use-of-turing-natural-language-representation\">Bing announces updates that make use of Turing Natural Language Generation and expanded use of Turing Natural Language Representation<\/h3>\n\n\n\n<p>Bing announces <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blogs.bing.com\/search-quality-insights\/september-2020\/Introducing-the-next-wave-of-AI-at-Scale-innovations-in-Bing\" target=\"_blank\" rel=\"noopener noreferrer\">new updates for search utilizing Microsoft Turing capabilities<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. These included improvements for Autosuggest, the \u201cPeople Also Ask\u201d (PAA) feature, an expansion of cross-lingual intelligent answers to over 100 languages and 200 regions, and semantic highlighting for captions to better highlight answers in search.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"updates-and-optimizations-for-the-deepspeed-library-announced\">Updates and optimizations for the DeepSpeed library announced<\/h3>\n\n\n\n<p>Microsoft researchers worked hard all year to make significant updates to the DeepSpeed deep learning training optimization library. In this release, researchers introduced <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/\">3D parallelism, ZeRO-Offload, DeepSpeed Sparse Attention, and 1-bit Adam<\/a>. These updates, among other advances, allowed more people to use DeepSpeed with fewer resources and expanded the scope of its efficiency across compute, memory, and communication, enabling training of models up to 1 trillion parameters.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Event<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/video\/zero-fastest-bert-increasing-the-scale-and-speed-of-deep-learning-training-in-deepspeed\" data-bi-cN=\"ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed webinar\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed webinar<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tJun\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"deberta-decoding-enhanced-bert-with-disentangled-attention\">DeBERTa: Decoding enhanced BERT with disentangled attention<\/h3>\n\n\n\n<p>Microsoft researchers created DeBERTa (Decoding enhanced BERT with disentangled attention), a Transformer-based neural language model that makes two changes to BERT. It uses disentangled attention for self-attention, with each word represented by two vectors that encode content and position. The attention weights for these words are then computed using their contents and relative positions. DeBERTa also enhances the output layer of BERT for pretraining, by replacing the output softmax layer in BERT with an enhanced masked decoder (EMD) to predict the masked tokens during pretraining.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deberta-decoding-enhanced-bert-with-disentangled-attention-2\/\" data-bi-cN=\"DeBERTa: Decoding-Enhanced BERT with Disentangled Attention\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeBERTa: Decoding-Enhanced BERT with Disentangled Attention<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tool<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/DeBERTa\" data-bi-cN=\"DeBERTa\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeBERTa<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"microsoft-researchers-release-xglue-a-benchmark-dataset-for-cross-lingual-transfer-learning-in-language-models\">Microsoft researchers release XGLUE, a benchmark dataset for cross-lingual transfer learning in language models<\/h3>\n\n\n\n<p>To test language models\u2019 ability to perform zero-shot cross-lingual transfer capability, Microsoft researchers announce the release of the XGLUE benchmark dataset. Using training data available only in English, the dataset comprises 11 downstream tasks covering 19 languages, including Italian, Portuguese, Swahili, and Urdu. The tasks included cover cross-lingual natural language understanding and generation, as well as tests unique to creating and evaluating search engine and news site scenarios.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tMay\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"turing-natural-language-representation-updates-for-bing-announced\">Turing Natural Language Representation updates for Bing announced<\/h3>\n\n\n\n<p>Microsoft Bing shares how <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blogs.bing.com\/search\/2020_05\/AI-at-Scale-in-Bing\/\" target=\"_blank\" rel=\"noopener noreferrer\">Turing language model capabilities are powering features in the search engine<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Features included synthesizing a simple \u201cyes\u201d or \u201cno\u201d response to applicable search queries, using a zero-shot approach to fine-tune only an English language model for translation into 100 different languages that had pretrained models, and improving query intent understanding.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"deepspeed-team-announce-zero-2-and-optimizations-that-set-the-fastest-bert-training-record-at-the-time\">DeepSpeed team announce ZeRO-2 and optimizations that set the fastest BERT training record at the time<\/h3>\n\n\n\n<figure class=\"wp-block-image alignright size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"704\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2-1024x704.png\" alt=\"Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains superlinear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.\" class=\"wp-image-660462\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2-1024x704.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2-300x206.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2-768x528.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2-800x550.png 800w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Fig-2-updated-larger_DeepSpeed2.png 1203w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Improvements to the DeepSpeed library allow Microsoft researchers to set the fastest BERT training record at the time: 44 minutes on 1,024 NVIDIA GPUs. To do this, they used new technology that used kernel optimizations to boost single GPU performance of models like BERT by more than 30%. These optimizations also allowed for better scaling of large models. In ZeRO-2, memory footprints were reduced in gradients, activation memory, and fragmented memory to improve the scale and speed of deep learning training with DeepSpeed by an order of magnitude.<\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tApr\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"microsoft-researchers-introduce-oscar-for-vision-and-language-pretraining\">Microsoft researchers introduce Oscar for vision and language pretraining<\/h3>\n\n\n\n<figure class=\"wp-block-image alignright size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1024x576.png\" alt=\"Oscar object semantics graphic\" class=\"wp-image-659415\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1920x1080.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/Oscar\" target=\"_blank\" rel=\"noopener noreferrer\">Oscar (Object-Semantics Aligned Pretraining)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> arose from the observation that objects can be used as anchor points to make learning semantic alignments between images and texts easier. By coupling images with language in a shared space, Oscar allows objects to act as anchor points in aligning the semantics between images and words. The vision and language pretraining (VLP) framework set state-of-the-art performance on six well-established vision-and-language tasks.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\" data-bi-cN=\"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"smart-using-principled-regularized-optimization-for-pre-trained-natural-language-model-fine-tuning\">SMART: Using principled regularized optimization for pre-trained natural language model fine-tuning<\/h3>\n\n\n\n<p>In transfer learning, fine-tuning pretrained natural language processing (NLP) models can cause the model to overfit training data of downstream tasks and fail to generalize unseen data. The SMART framework uses smoothness-inducing regularization, in order to manage the complexity of the model, and Bregman proximal point optimization, which can prevent aggressive updating.<\/p>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/smart-robust-and-efficient-fine-tuning-for-pre-trained-natural-language-models-through-principled-regularized-optimization\/\" data-bi-cN=\"SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"adversarial-training-for-large-neural-language-models\">Adversarial training for large neural language models<\/h3>\n\n\n\n<p> This work shared a comprehensive study of adversarial training in all stages of training for large neural language models: pretraining from scratch, continual pretraining on a well-trained model, and fine-tuning for specific tasks. The researchers also created a general algorithm to maximize adversarial loss, called ALUM, which obtained substantial gains over BERT on many NLP tasks. <\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/adversarial-training-for-large-neural-language-models\/\" data-bi-cN=\"Adversarial Training for Large Neural Language Models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Adversarial Training for Large Neural Language Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"effects-of-the-adaptive-learning-rate-on-stochastic-gradient-based-optimization\">Effects of the adaptive learning rate on stochastic gradient-based optimization<\/h3>\n\n\n\n<p>Microsoft researchers and collaborators looked more closely at how warmup should be conducted in stochastic gradient-based optimization. More specifically, they zeroed in on the variance issue in adaptive learning rate and observe its root cause: limited amount of training samples used cause undesirably large variance in early stages of training. They also presented a new variant of Adam, called RAdam, to correct this problem of variance, and compares well with heuristic warmup.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/on-the-variance-of-the-adaptive-learning-rate-and-beyond\/\" data-bi-cN=\"On the Variance of the Adaptive Learning Rate and Beyond\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>On the Variance of the Adaptive Learning Rate and Beyond<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment has-date\" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t<div class=\"moment__date-year\">\n\t\t\t\t\t2020\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t<div class=\"moment__date-month\">\n\t\t\t\t\tFeb\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"microsoft-project-turing-team-announces-turing-nlg-language-model-clocking-in-at-17-billion-parameters\">Microsoft Project Turing team announces Turing-NLG language model, clocking in at 17 billion parameters<\/h3>\n\n\n\n<p>In February,&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/\">the Turing Natural Language Generation (Turing-NLG) model<\/a>&nbsp;made waves as the largest language model at the time. To train the Transformer-based generative model, researchers used a novel model parallelism technique, courtesy of the Zero Redundancy Optimizer (ZeRO), and tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework. Among its capabilities, the team highlighted direct question answering, zero-shot question answering, and abstractive summarization with less supervision.<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/content.blubrry.com\/microsoftresearch\/msr_majumder_112.mp3\"><\/audio><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/microsofts-ai-transformation-project-turing-and-smarter-search-with-rangan-majumder\/\">Visit the podcast page with transcript ><\/a><\/p>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"microsoft-deepspeed-team-releases-open-source-library-including-zero-a-novel-zero-redundancy-optimizer\">Microsoft DeepSpeed team releases open-source library, including ZeRO, a novel zero redundancy optimizer<\/h3>\n\n\n\n<figure class=\"wp-block-image alignright size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"636\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/Blog_DeepSpeed3_MainHero_HighRes-1024x636.jpg\" alt=\"DeepSpeed graphs\" class=\"wp-image-690747\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/Blog_DeepSpeed3_MainHero_HighRes-1024x636.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/Blog_DeepSpeed3_MainHero_HighRes-300x186.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/Blog_DeepSpeed3_MainHero_HighRes-768x477.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/Blog_DeepSpeed3_MainHero_HighRes.jpg 1237w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>In conjunction with Turing-NLG, Microsoft researchers released the open-source&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\">DeepSpeed<\/a>&nbsp;library, which improved large model training in four key areas: scale, speed, cost, and usability. Initially, the library included ZeRO-1, which decreased the resources needed for model and data parallelism while greatly increasing the number of parameters able to be trained, up to 100 billion.<\/p>\n\n\n\n<p>DeepSpeed, along with other distributed training tools are being incorporated into the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/onnxruntime\">ONNX (Open Neural Network Exchange) runtime<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, an open-source high performance tool for machine learning models.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\" data-bi-cN=\"ZeRO: Memory Optimizations Toward Training Trillion Parameter Models\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ZeRO: Memory Optimizations Toward Training Trillion Parameter Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Project<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\" data-bi-cN=\"DeepSpeed\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DeepSpeed<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Tool<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/microsoft\/onnxruntime\" data-bi-cN=\"ONNX Runtime\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>ONNX Runtime<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\n\t<li class=\"wp-block-msr-block-moment moment \" data-bi-aN=\"block-moment\">\n\t\t<div class=\"moment__dot moment__dot--start\" role=\"presentation\"><\/div>\n\t\t<div role=\"presentation\"><\/div>\n\t\t<div class=\"moment__details\">\n\t\t\t\t\t\t<div class=\"moment__counter\"><\/div>\n\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t<div class=\"moment__content\">\n\t\t\t\n\n<h3 class=\"wp-block-heading moment__title\" id=\"unilmv2-pseudo-masked-language-models-for-unified-language-model-pre-training\">UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training<\/h3>\n\n\n\n<p>This paper introduced a new method for pretraining unified language models using a pseudo-masked language model (PSLM). This method can be used for both autoencoding and partially autoregressive language modeling tasks, and the results with UniLMv2, a model trained using PSLM, achieved state of the art on a number of natural language understanding and generation tasks across widely used benchmarks.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/unilmv2-pseudo-masked-language-models-for-unified-language-model-pre-training\/\" data-bi-cN=\"UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\t\t<\/div>\n\t\t<div class=\"moment__dot moment__dot--end\" role=\"presentation\"><\/div>\n\t<\/li>\n\t\n\t\t<\/ol>\n\t<\/div>\n\t\n\n","protected":false},"excerpt":{"rendered":"<p>AI at Scale is an applied research initiative that works to evolve Microsoft products with the adoption of deep learning for both natural language text and image processing. Our work is actively being integrated into Microsoft products, including Bing, Office, and Xbox.<\/p>\n","protected":false},"featured_media":649848,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-649749","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[640596,748009,740689,701698,671469,658413,658251,658242,655848,655842,655836,655830,655824,650868,649755,328361,626646,626103,619842,611088,596767,590413,587764,578926,574707,559176,508631,479904,476322,442938],"related-downloads":[],"related-videos":[739810],"related-groups":[144812],"related-events":[677526],"related-opportunities":[],"related-posts":[722737,895428,861387,811273,801178,791159,783490,778711,766675,764275,747265,729205,597991,717256,715399,712588,709066,698635,689370,664554,658659,657990,644958,635340,635250],"related-articles":[],"tab-content":[{"id":0,"name":"Timeline","content":"[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">February 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Microsoft Project Turing team announces Turing-NLG language model, clocking in at 17 billion parameters<\/h3>\r\nIn February, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/\">The Turing Natural Language Generation (Turing-NLG) model<\/a> made waves as the largest language model at the time. To train the Transformer-based generative model, researchers used a novel model parallelism technique, courtesy of the Zero Redundancy Optimizer (ZeRO), and tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework. Among its capabilities, the team highlighted direct question answering, zero-shot question answering, and abstractive summarization with less supervision.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/microsofts-ai-transformation-project-turing-and-smarter-search-with-rangan-majumder\/\">Listen to the podcast &gt;<\/a>\r\n\r\nhttps:\/\/content.blubrry.com\/microsoftresearch\/msr_majumder_112.mp3\r\n<div style=\"height: 50px\"><\/div>\r\n<h3>Microsoft DeepSpeed team releases open-source library, including ZeRO, a novel zero redundancy optimizer<\/h3>\r\n<img class=\"alignright wp-image-690915\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/DeepSpeed_3_FeatureImage_MCRSite-300x169.png\" alt=\"\" width=\"400\" height=\"225\" \/>In conjunction with Turing-NLG, Microsoft researchers released the open-source <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\">DeepSpeed<\/a> library, which improved large model training in four key areas: scale, speed, cost, and usability. Initially, the library included ZeRO-1, which decreased the resources needed for model and data parallelism while greatly increasing the number of parameters able to be trained, up to 100 billion.\r\n\r\nDeepSpeed, along with other distributed training tools are being incorporated into the <a href=\"https:\/\/github.com\/microsoft\/onnxruntime\">ONNX (Open Neural Network Exchange) runtime<\/a>, an open-source high performance tool for machine learning models.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimizations-toward-training-trillion-parameter-models\/\">Read the publication &gt;<\/a>\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/deepspeed\/\">Explore the DeepSpeed project &gt;<\/a>\r\n\r\n<a href=\"https:\/\/github.com\/microsoft\/onnxruntime\">Download ONNX Runtime &gt;<\/a>\r\n<div style=\"height: 50px\"><\/div>\r\n<h3>UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training<\/h3>\r\nThis paper introduced a new method for pretraining unified language models using a pseudo-masked language model (PSLM). This method can be used for both autoencoding and partially autoregressive language modeling tasks, and the results with UniLMv2, a model trained using PSLM, achieved state of the art on a number of natural language understanding and generation tasks across widely used benchmarks.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/unilmv2-pseudo-masked-language-models-for-unified-language-model-pre-training\/\">Read the publication &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">April 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Microsoft researchers introduce Oscar for vision and language pretraining<\/h3>\r\n<img class=\"wp-image-659415 alignright\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1024x576.png\" alt=\"\" width=\"400\" height=\"225\" \/><a href=\"https:\/\/github.com\/microsoft\/Oscar\">Oscar (Object-Semantics Aligned Pretraining)<\/a> arose from the observation that objects can be used as anchor points to make learning semantic alignments between images and texts easier. By coupling images with language in a shared space, Oscar allows objects to act as anchor points in aligning the semantics between images and words. The vision and language pretraining (VLP) framework set state-of-the-art performance on six well-established vision-and-language tasks.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">Read the publication &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>SMART: Using principled regularized optimization for pre-trained natural language model fine-tuning<\/h3>\r\nIn transfer learning, fine-tuning pretrained natural language processing (NLP) models can cause the model to overfit training data of downstream tasks and fail to generalize unseen data. The SMART framework uses smoothness-inducing regularization, in order to manage the complexity of the model, and Bregman proximal point optimization, which can prevent aggressive updating.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/smart-robust-and-efficient-fine-tuning-for-pre-trained-natural-language-models-through-principled-regularized-optimization\/\">Read the publication &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>Adversarial training for large neural language models<\/h3>\r\nThis work shared a comprehensive study of adversarial training in all stages of training for large neural language models: pretraining from scratch, continual pretraining on a well-trained model, and fine-tuning for specific tasks. The researchers also created a general algorithm to maximize adversarial loss, called ALUM, which obtained substantial gains over BERT on many NLP tasks.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/adversarial-training-for-large-neural-language-models\/\">Read the publication &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>Effects of the adaptive learning rate on stochastic gradient-based optimization<\/h3>\r\nMicrosoft researchers and collaborators looked more closely at how warmup should be conducted in stochastic gradient-based optimization. More specifically, they zeroed in on the variance issue in adaptive learning rate and observe its root cause: limited amount of training samples used cause undesirably large variance in early stages of training. They also presented a new variant of Adam, called RAdam, to correct this problem of variance, and compares well with heuristic warmup.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/on-the-variance-of-the-adaptive-learning-rate-and-beyond\/\">Read the publication &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">May 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Turing Natural Language Representation updates for Bing announced<\/h3>\r\nMicrosoft Bing shares how <a href=\"https:\/\/blogs.bing.com\/search\/2020_05\/AI-at-Scale-in-Bing\/\" target=\"_blank\" rel=\"noopener\">Turing language model capabilities are powering features in the search engine<\/a>. Features included synthesizing a simple \u201cyes\u201d or \u201cno\u201d response to applicable search queries, using a zero-shot approach to fine-tune only an English language model for translation into 100 different languages that had pretrained models, and improving query intent understanding.\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>DeepSpeed team announce ZeRO-2 and optimizations that set the fastest BERT training record at the time<\/h3>\r\n<img class=\"wp-image-660447 alignright\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/fig-2-larger-update-_-deepspeed-300x206.png\" alt=\"Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains superlinear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters. \" width=\"400\" height=\"275\" \/>\r\n\r\nImprovements to the DeepSpeed library allow Microsoft researchers to set the fastest BERT training record at the time: 44 minutes on 1,024 NVIDIA GPUs. To do this, they used new technology that used kernel optimizations to boost single GPU performance of models like BERT by more than 30%. These optimizations also allowed for better scaling of large models. In ZeRO-2, memory footprints were reduced in gradients, activation memory, and fragmented memory to improve the scale and speed of deep learning training with DeepSpeed by an order of magnitude.\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">June 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>DeBERTa: Decoding enhanced BERT with disentangled attention<\/h3>\r\nMicrosoft researchers created DeBERTa (Decoding enhanced BERT with disentangled attention), a Transformer-based neural language model that makes two changes to BERT. It uses disentangled attention for self-attention, with each word represented by two vectors that encode content and position. The attention weights for these words are then computed using their contents and relative positions. DeBERTa also enhances the output layer of BERT for pretraining, by replacing the output softmax layer in BERT with an enhanced masked decoder (EMD) to predict the masked tokens during pretraining.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deberta-decoding-enhanced-bert-with-disentangled-attention-2\/\">Read the publication &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>Microsoft researchers release XGLUE, a benchmark dataset for cross-lingual transfer learning in language models<\/h3>\r\nTo test language models\u2019 ability to perform zero-shot cross-lingual transfer capability, Microsoft researchers announce the release of the XGLUE benchmark dataset. Using training data available only in English, the dataset comprises 11 downstream tasks covering 19 languages, including Italian, Portuguese, Swahili, and Urdu. The tasks included cover cross-lingual natural language understanding and generation, as well as tests unique to creating and evaluating search engine and news site scenarios.\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">September 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Bing announces updates that make use of Turing Natural language Generation and expanded use of Turing Natural Language Representation<\/h3>\r\nBing announces <a href=\"https:\/\/blogs.bing.com\/search-quality-insights\/september-2020\/Introducing-the-next-wave-of-AI-at-Scale-innovations-in-Bing\" target=\"_blank\" rel=\"noopener\">new updates for search utilizing Microsoft Turing capabilities<\/a>. These included improvements for Autosuggest, the \u201cPeople Also Ask\u201d (PAA) feature, an expansion of cross-lingual intelligent answers to over 100 languages and 200 regions, and semantic highlighting for captions to better highlight answers in search.\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>Updates and optimizations for the DeepSpeed library announced<\/h3>\r\nMicrosoft researchers worked hard all year to make significant updates to the DeepSpeed deep learning training optimization library. In this release, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/\">researchers introduced 3D parallelism, ZeRO-Offload, DeepSpeed Sparse Attention, and 1-bit Adam<\/a>. These updates, among other advances, allowed more people to use DeepSpeed with fewer resources and expanded the scope of its efficiency across compute, memory, and communication, enabling training of models up to 1 trillion parameters.\r\n\r\n<a href=\"https:\/\/note.microsoft.com\/MSR-Webinar-DeepSpeed-Registration-On-Demand.html\" target=\"_blank\" rel=\"noopener\">Watch the webinar &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">October 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Turing Universal Language Representation model takes top spot on XTREME leaderboard<\/h3>\r\nTULRv2, the Turing Universal Language Representation model for cross-lingual generalization, tops the Google XTREME leaderboard. The model uses another recent Microsoft innovation, InfoXLM, to create a universal model that represents 94 languages in the same vector space, and it is being used to power features in Microsoft Word, Outlook, and Teams. The XTREME leaderboard covers 40 languages spanning 12 language families, and it challenges models to reason about syntax and semantics at varying levels.\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">November 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>SC-GPT model: Using few-shot natural language generation for task-oriented dialog<\/h3>\r\nIn this work, Microsoft researchers set out to improve generalization with limited labelled data for natural language generation (NLG) models. To do so, they developed the first NLG benchmark to simulate few-shot learning in task-oriented dialog systems. They also created the SC-GPT model, a multi-layer Transformer neural language model, which generates semantically controlled responses conditioned on a given semantic form and requires much fewer domain labels to generalize to new domains.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/few-shot-natural-language-generation-for-task-oriented-dialog\/\">Read the publication &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">December 2020<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Project Brainwave + Microsoft Floating Point<\/h3>\r\nMicrosoft researchers and engineers <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-microsoft-custom-data-type-for-efficient-inference\/\">announced Microsoft Floating Point<\/a>, a data type that brings together the efficiency of integer data types with accuracy comparable to floating point. It\u2019s being used in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-brainwave\/\">Project Brainwave<\/a> architecture to power real-time production-scale deep neural network inference in the cloud, and it enables features in many Microsoft products, including Office 365 and Bing. Project Brainwave architecture\u2014to be turbocharged by silicon-hardened Microsoft Floating Point\u2014is expected to have a pivotal role in the future of algorithm codesign in hardware.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point\/\">Read the publication &gt;<\/a>\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-microsoft-custom-data-type-for-efficient-inference\/?OCID=msr_blog_Brainwave_NeurIPS_project\">Read the blog &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">January 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>DeBERTa<\/h3>\r\nUpdates to the Transformer-based DeBERTa neural language model boost its performance, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/\">topping the SuperGLUE and GLUE benchmark leaderboards<\/a>. The updates include training a larger version of the model that contains 48 Transformer layers and 1.5 billion parameters. The updated single version of DeBERTa surpasses human performance on the SuperGLUE benchmark for the first time based on macro-average score, while the ensemble model outperforms the single version to top both leaderboards. DeBERTa is being incorporated into the next iteration of the Microsoft Turing natural language representation model, Turing NLRv4.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deberta-decoding-enhanced-bert-with-disentangled-attention-2\/\">Read the publication &gt;<\/a>\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/?OCID=msr_blog_deberta_project\">Read the blog &gt;<\/a>\r\n\r\n<a href=\"https:\/\/github.com\/microsoft\/DeBERTa\" target=\"_blank\" rel=\"noopener\">DeBERTa on GitHub &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>VinVL<\/h3>\r\nResearchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed VinVL (Visual features in Vision-Language), and performed a comprehensive empirical study to show that visual features matter significantly in VL models. Learn more in this <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/\">blog post<\/a>.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/vinvl-making-visual-representations-matter-in-vision-language-models\/\">Read the publication &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">February 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Spell correction at scale<\/h3>\r\nCustomers around the world use Microsoft products in over 100 languages, yet most do not come with high-quality spell correction. This prevents customers from maximizing their ability to search for information on the web and enterprise\u2014and even to author content. With AI at Scale, we used deep learning along with language families to solve this problem for customers by building what we believe is the most comprehensive and accurate spelling correction system ever in terms of language coverage and accuracy. Learn more in this <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages\/?OCID=msr_blog_Speller100_project\">blog post<\/a>.\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">March 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>The best of AI at Scale: Semantic search capabilities available to Azure customers in preview<\/h3>\r\nMicrosoft Bing partnered with Azure Cognitive Search to make state-of-the-art search AI available to Azure customers through semantic search. Semantic search enables modern search experiences such as semantic ranking, extractive summarization, and machine reading comprehension. These features were built through the application of Microsoft Research technology and advancements\u2014including UniLM, Multi-Task Deep Neural Networks, MiniLM, and utilizing graph attention networks for machine reading comprehension\u2014to search scenarios. Deep neural network transfer learning allows the models to run well in Azure. An online A\/B experiment in which semantic search was enabled for Microsoft Docs produced a significant 4.5 percent clickthrough rate increase on challenging queries (three or more words)\u2014the largest relevance improvement the Microsoft Docs team has seen. Learn more in this <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-science-behind-semantic-search-how-ai-from-bing-is-powering-azure-cognitive-search\/\">blog post<\/a>.\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">April 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>ZeRO-Infinity<\/h3>\r\n<img class=\"alignright wp-image-741007\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_deepspeed_update_figure_nologo_Still-2_04-2020-300x169.jpg\" alt=\"DeepSpeed figure\" width=\"400\" height=\"225\" \/>The DeepSpeed Team releases ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40\\% of peak), while also demonstrating super linear scalability.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-infinity-breaking-the-gpu-memory-wall-for-extreme-scale-deep-learning\/\">Read the publication &gt;<\/a>\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training\/\">Read the blog &gt;<\/a>\r\n\r\n<a href=\"https:\/\/github.com\/microsoft\/DeepSpeed\" target=\"_blank\" rel=\"noopener\">DeepSpeed on GitHub &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">May 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>DeepSpeed: 3 times faster, 5 times cheaper inference for large DL models<\/h3>\r\nThe DeepSpeed library has improved scale, speed, cost, and usability for large model training by orders of magnitude. Now, the team introduces DeepSpeed Inference \u2014 with high-performance multi-GPU inference and mixture of quantization \u2014 to significantly reduce the latency and cost of serving large DL models. The team also announce a suite of new features for compressed training, such as progressive layer dropping and 1-bit LAMB, to achieve fast and accurate training with low cost.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression\">Read the blog &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]\r\n\r\n[row][column class=\"m-col-6-24\"]\r\n<p style=\"text-align: right;margin-bottom: 0\"><img class=\"alignnone wp-image-712573\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/journey-calendar-icon.png\" alt=\"calendar icon\" width=\"25\" height=\"25\" \/><\/p>\r\n<p style=\"text-align: right\">August 2021<\/p>\r\n[\/column] [column class=\"m-col-18-24\"]\r\n<h3>Make Every feature Binary (MEB)<\/h3>\r\nThe Microsoft Bing team developed and operationalized \u201cMake Every feature Binary\u201d (MEB), a large-scale sparse model that complements production Transformer models to improve search relevance. To make search more accurate and dynamic, MEB is a 135B parameter model which harnesses the power of large data and allows for an input feature space with over 200 billion binary features that reflect the subtle relationships between search queries and documents.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance\">Read the blog &gt;<\/a>\r\n<div style=\"height: 30px\"><\/div>\r\n<h3>DeepSpeed powers 8x larger MoE model training with high performance<\/h3>\r\nDeepSpeed continues to innovate, enabling mixture of experts model training with bigger size, fewer resources, excellent throughput, and near-linear scalability. It combines multidimensional parallelism and heterogenous memory technologies to support massively large MoE models. Model scientists of Microsoft use DeepSpeed MoE to train <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/\">Z-code<\/a> MoE, a production-quality, multi-lingual, and multi-task language model with 10 billion parameters, achieving state-of-the-art results on machine translation and cross-lingual summarization tasks. Learn more in this <a href=\"https:\/\/aka.ms\/AAdgoco\">blog post<\/a>.\r\n\r\n<a href=\"https:\/\/www.deepspeed.ai\/tutorials\/mixture-of-experts\" target=\"_blank\" rel=\"noopener\">View the tutorial &gt;<\/a>\r\n\r\n<hr \/>\r\n\r\n[\/column][\/row]"}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Sam Ade Jacobs","user_id":43503,"people_section":"Section name 0","alias":"samjacobs"},{"type":"user_nicename","display_name":"Doug Burger","user_id":31582,"people_section":"Section name 0","alias":"dburger"},{"type":"user_nicename","display_name":"Weizhu Chen","user_id":34863,"people_section":"Section name 0","alias":"wzchen"},{"type":"user_nicename","display_name":"Junyan Chen","user_id":37332,"people_section":"Section name 0","alias":"junyanch"},{"type":"user_nicename","display_name":"Nick Craswell","user_id":33088,"people_section":"Section name 0","alias":"nickcr"},{"type":"user_nicename","display_name":"Jianfeng Gao","user_id":32246,"people_section":"Section name 0","alias":"jfgao"},{"type":"user_nicename","display_name":"Mahdi Ghandi","user_id":37506,"people_section":"Section name 0","alias":"maghandi"},{"type":"user_nicename","display_name":"Xiaodong Liu","user_id":34877,"people_section":"Section name 0","alias":"xiaodl"},{"type":"user_nicename","display_name":"Jidong Long (\u9f99\u7ee7\u4e1c)","user_id":40027,"people_section":"Section name 0","alias":"jilong"},{"type":"user_nicename","display_name":"Jingwen Lu","user_id":40021,"people_section":"Section name 0","alias":"jinlu"},{"type":"user_nicename","display_name":"Rangan Majumder","user_id":38931,"people_section":"Section name 0","alias":"ranganm"},{"type":"user_nicename","display_name":"Todd Massengill","user_id":34236,"people_section":"Section name 0","alias":"toddma"},{"type":"user_nicename","display_name":"Madan Musuvathi","user_id":32766,"people_section":"Section name 0","alias":"madanm"},{"type":"user_nicename","display_name":"Xia Song","user_id":39315,"people_section":"Section name 0","alias":"xiaso"},{"type":"user_nicename","display_name":"Furu Wei","user_id":31830,"people_section":"Section name 0","alias":"fuwei"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/649749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":28,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/649749\/revisions"}],"predecessor-version":[{"id":1087521,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/649749\/revisions\/1087521"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/649848"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=649749"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=649749"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=649749"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=649749"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=649749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}