AI at Scale

AI at Scale


News & features

News & features

News & features

News & features


Models, Infrastructure and Hardware for Next-Generation AI Applications

AI innovation today is bound by limitations of compute infrastructure, effectiveness of machine learning models, and ease of development.

Microsoft’s AI at Scale initiative is pioneering a new approach that will result in next-generation AI capabilities that are scaled across the company’s products and AI platforms.

This includes developing a new class of large, centralized AI models that can be scaled and specialized across product domains, as well as creating state-of-the-art hardware and infrastructure to power this new class of models.

AI at Scale builds on years of systems work by Microsoft researchers, particularly in the area of parallel computation, that make it possible to more quickly train machine learning models at an unprecedented scale. For instance, Project Parasail, established in 2014, pioneered a novel approach to parallelizing a large class of seemingly sequential applications, particularly stochastic gradient descent, wherein dependencies are treated at runtime as symbolic values. PipeDream, part of Project Fiddle, is a novel approach to model training called pipeline parallelism to overcome the higher communication costs of data parallelism and the hardware resource inefficiency of model parallelism. The result is up to 5.3 times faster training time than traditional approaches. Read this blog post to learn more.

These capabilities, together with others like DeepSpeed, have been being integrated into the ONNX (Open Neural Network Exchange) runtime, to add distributed training support to this open-source high-performance runtime for machine learning models that is framework-agnostic and hardware-agnostic. This makes it the most efficient way to train and do inference of machine learning models in the framework and hardware of their choice.

DeepSpeed for large model training

DeepSpeed is a PyTorch-compatible library that vastly improves large model training by improving scale, speed, cost and usability—unlocking the ability to train models with over 100 billion parameters. One piece of the DeepSpeed library, ZeRO 2, is a parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. DeepSpeed is open-sourced at You can learn more about them in this blog post from February 2020, and this update on ZeRO from May 2020.

AI at Scale is enabling breakthroughs in areas such as natural language processing (NLP), and multi-modality (combining language with other types of data, such as images, video, and speech).

In September 2020, DeepSpeed was updated with new system technologies: trillion-parameter model training with 3D parallelism, ZeRO-Offload to enable training of 10x larger models, Sparse Attention to power 10x longer sequences and 6x faster execution, and 1-bit Adam with up to 5x communication volume reduction.

Advances in natural language processing

The Turing Natural Language Generation (T-NLG) is a 17-billion parameter language model that outperforms the state of the art on many downstream NLP tasks. In particular, it can enhance the Microsoft Office experience through writing assistance and answering reader questions, and paves the way for more fluent digital assistants. You can read more about T-NLG in this blog post. In September 2020, Bing announced new updates that makes use of T-NLG to improve autosuggest results; read this blog post to learn more.

On the multi-modality language-image front, we’ve significantly outperform the state-of-the-art on downstream language-image tasks (e.g. visual search) with Oscar (Object-Semantics Aligned Pre-training).

Recently, pre-trained models such as UnicoderM-BERT, and XLM have been developed to learn multilingual representations for cross-lingual and multilingual tasks. By performing masked language model, translation language model, and other bilingual pre-training tasks on multilingual and bilingual corpora with shared vocabulary and weights for multiple languages, these models obtain surprisingly good cross-lingual capability. However, the community still lacks benchmark datasets to evaluate such capability. To help researchers further advance language-agnostic models and make AI systems more inclusive, the XGLUE dataset helps researchers test a language model’s zero-shot cross-lingual transfer capability – its ability to transfer what it learned in English to the same task in other languages. Download the dataset here, and read this blog post to learn more.

We are incorporating these breakthroughs into the company’s products, including Bing, Office, Dynamics, and Xbox. Read this blog post to learn more.

Project Brainwave: new hardware for deep learning

In the realm of hardware, Project Brainwave is a deep learning platform for real-time AI inference in the cloud and on the edge. A soft Neural Processing Unit (NPU), based on a high-performance field-programmable gate array (FPGA), accelerates deep neural network (DNN) inferencing, with applications in computer vision and natural language processing. This approach is transforming computing by augmenting CPUs with an interconnected and configurable compute layer composed of programmable silicon.

With a high-performance, precision-adaptable FPGA soft processor, Microsoft datacenters can serve pre-trained DNN models with high efficiencies at low batch sizes. The use of an FPGA means that it is flexible for continuous innovations and improvements, making the infrastructure future-proof.
Exploiting FPGAs on a datacenter-scale compute fabric, a single DNN model can be deployed as a scalable hardware microservice that leverages multiple FPGAs to create web-scale services. This can process massive amounts of data in real time.

Learn more about Project Brainwave on:

Spell correction at scale

Customers around the world use Microsoft products in over 100 languages, yet most do not come with high-quality spell correction. This prevents customers from maximizing their ability to search for information on the web and enterprise—and even to author content. With AI at Scale, we used deep learning along with language families to solve this problem for customers by building what we believe is the most comprehensive and accurate spelling correction system ever in terms of language coverage and accuracy. Learn more in this blog post.

Learn more

Visit the Microsoft Innovation site to learn more about this initiative, including a deep dive into the technology.

Nominate your organization for a private preview of Semantic Search by Project Turing >

Visit the Microsoft Project Turing site >



calendar icon

February 2020

Microsoft Project Turing team announces Turing-NLG language model, clocking in at 17 billion parameters

In February, The Turing Natural Language Generation (Turing-NLG) model made waves as the largest language model at the time. To train the Transformer-based generative model, researchers used a novel model parallelism technique, courtesy of the Zero Redundancy Optimizer (ZeRO), and tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework. Among its capabilities, the team highlighted direct question answering, zero-shot question answering, and abstractive summarization with less supervision.

Listen to the podcast >

Microsoft DeepSpeed team releases open-source library, including ZeRO, a novel zero redundancy optimizer

In conjunction with Turing-NLG, Microsoft researchers released the open-source DeepSpeed library, which improved large model training in four key areas: scale, speed, cost, and usability. Initially, the library included ZeRO-1, which decreased the resources needed for model and data parallelism while greatly increasing the number of parameters able to be trained, up to 100 billion.

DeepSpeed, along with other distributed training tools are being incorporated into the ONNX (Open Neural Network Exchange) runtime, an open-source high performance tool for machine learning models.

Read the publication >

Explore the DeepSpeed project >

Download ONNX Runtime >

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

This paper introduced a new method for pretraining unified language models using a pseudo-masked language model (PSLM). This method can be used for both autoencoding and partially autoregressive language modeling tasks, and the results with UniLMv2, a model trained using PSLM, achieved state of the art on a number of natural language understanding and generation tasks across widely used benchmarks.

Read the publication >

calendar icon

April 2020

Microsoft researchers introduce Oscar for vision and language pretraining

Oscar (Object-Semantics Aligned Pretraining) arose from the observation that objects can be used as anchor points to make learning semantic alignments between images and texts easier. By coupling images with language in a shared space, Oscar allows objects to act as anchor points in aligning the semantics between images and words. The vision and language pretraining (VLP) framework set state-of-the-art performance on six well-established vision-and-language tasks.

Read the publication >

SMART: Using principled regularized optimization for pre-trained natural language model fine-tuning

In transfer learning, fine-tuning pretrained natural language processing (NLP) models can cause the model to overfit training data of downstream tasks and fail to generalize unseen data. The SMART framework uses smoothness-inducing regularization, in order to manage the complexity of the model, and Bregman proximal point optimization, which can prevent aggressive updating.

Read the publication >

Adversarial training for large neural language models

This work shared a comprehensive study of adversarial training in all stages of training for large neural language models: pretraining from scratch, continual pretraining on a well-trained model, and fine-tuning for specific tasks. The researchers also created a general algorithm to maximize adversarial loss, called ALUM, which obtained substantial gains over BERT on many NLP tasks.

Read the publication >

Effects of the adaptive learning rate on stochastic gradient-based optimization

Microsoft researchers and collaborators looked more closely at how warmup should be conducted in stochastic gradient-based optimization. More specifically, they zeroed in on the variance issue in adaptive learning rate and observe its root cause: limited amount of training samples used cause undesirably large variance in early stages of training. They also presented a new variant of Adam, called RAdam, to correct this problem of variance, and compares well with heuristic warmup.

Read the publication >

calendar icon

May 2020

Turing Natural Language Representation updates for Bing announced

Microsoft Bing shares how Turing language model capabilities are powering features in the search engine. Features included synthesizing a simple “yes” or “no” response to applicable search queries, using a zero-shot approach to fine-tune only an English language model for translation into 100 different languages that had pretrained models, and improving query intent understanding.

DeepSpeed team announce ZeRO-2 and optimizations that set the fastest BERT training record at the time

Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains superlinear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.

Improvements to the DeepSpeed library allow Microsoft researchers to set the fastest BERT training record at the time: 44 minutes on 1,024 NVIDIA GPUs. To do this, they used new technology that used kernel optimizations to boost single GPU performance of models like BERT by more than 30%. These optimizations also allowed for better scaling of large models. In ZeRO-2, memory footprints were reduced in gradients, activation memory, and fragmented memory to improve the scale and speed of deep learning training with DeepSpeed by an order of magnitude.

calendar icon

June 2020

DeBERTa: Decoding enhanced BERT with disentangled attention

Microsoft researchers created DeBERTa (Decoding enhanced BERT with disentangled attention), a Transformer-based neural language model that makes two changes to BERT. It uses disentangled attention for self-attention, with each word represented by two vectors that encode content and position. The attention weights for these words are then computed using their contents and relative positions. DeBERTa also enhances the output layer of BERT for pretraining, by replacing the output softmax layer in BERT with an enhanced masked decoder (EMD) to predict the masked tokens during pretraining.

Read the publication >

Microsoft researchers release XGLUE, a benchmark dataset for cross-lingual transfer learning in language models

To test language models’ ability to perform zero-shot cross-lingual transfer capability, Microsoft researchers announce the release of the XGLUE benchmark dataset. Using training data available only in English, the dataset comprises 11 downstream tasks covering 19 languages, including Italian, Portuguese, Swahili, and Urdu. The tasks included cover cross-lingual natural language understanding and generation, as well as tests unique to creating and evaluating search engine and news site scenarios.

calendar icon

September 2020

Bing announces updates that make use of Turing Natural language Generation and expanded use of Turing Natural Language Representation

Bing announces new updates for search utilizing Microsoft Turing capabilities. These included improvements for Autosuggest, the “People Also Ask” (PAA) feature, an expansion of cross-lingual intelligent answers to over 100 languages and 200 regions, and semantic highlighting for captions to better highlight answers in search.

Updates and optimizations for the DeepSpeed library announced

Microsoft researchers worked hard all year to make significant updates to the DeepSpeed deep learning training optimization library. In this release, researchers introduced 3D parallelism, ZeRO-Offload, DeepSpeed Sparse Attention, and 1-bit Adam. These updates, among other advances, allowed more people to use DeepSpeed with fewer resources and expanded the scope of its efficiency across compute, memory, and communication, enabling training of models up to 1 trillion parameters.

Watch the webinar >

calendar icon

October 2020

Turing Universal Language Representation model takes top spot on XTREME leaderboard

TULRv2, the Turing Universal Language Representation model for cross-lingual generalization, tops the Google XTREME leaderboard. The model uses another recent Microsoft innovation, InfoXLM, to create a universal model that represents 94 languages in the same vector space, and it is being used to power features in Microsoft Word, Outlook, and Teams. The XTREME leaderboard covers 40 languages spanning 12 language families, and it challenges models to reason about syntax and semantics at varying levels.

calendar icon

November 2020

SC-GPT model: Using few-shot natural language generation for task-oriented dialog

In this work, Microsoft researchers set out to improve generalization with limited labelled data for natural language generation (NLG) models. To do so, they developed the first NLG benchmark to simulate few-shot learning in task-oriented dialog systems. They also created the SC-GPT model, a multi-layer Transformer neural language model, which generates semantically controlled responses conditioned on a given semantic form and requires much fewer domain labels to generalize to new domains.

Read the publication >

calendar icon

December 2020

Project Brainwave + Microsoft Floating Point

Microsoft researchers and engineers announced Microsoft Floating Point, a data type that brings together the efficiency of integer data types with accuracy comparable to floating point. It’s being used in Project Brainwave architecture to power real-time production-scale deep neural network inference in the cloud, and it enables features in many Microsoft products, including Office 365 and Bing. Project Brainwave architecture—to be turbocharged by silicon-hardened Microsoft Floating Point—is expected to have a pivotal role in the future of algorithm codesign in hardware.

Read the publication >

Read the blog >

calendar icon

January 2021


Updates to the Transformer-based DeBERTa neural language model boost its performance, topping the SuperGLUE and GLUE benchmark leaderboards. The updates include training a larger version of the model that contains 48 Transformer layers and 1.5 billion parameters. The updated single version of DeBERTa surpasses human performance on the SuperGLUE benchmark for the first time based on macro-average score, while the ensemble model outperforms the single version to top both leaderboards. DeBERTa is being incorporated into the next iteration of the Microsoft Turing natural language representation model, Turing NLRv4.

Read the publication >

Read the blog >

DeBERTa on GitHub >


Researchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed VinVL (Visual features in Vision-Language), and performed a comprehensive empirical study to show that visual features matter significantly in VL models. Learn more in this blog post.

Read the publication >

calendar icon

February 2021

Spell correction at scale

Customers around the world use Microsoft products in over 100 languages, yet most do not come with high-quality spell correction. This prevents customers from maximizing their ability to search for information on the web and enterprise—and even to author content. With AI at Scale, we used deep learning along with language families to solve this problem for customers by building what we believe is the most comprehensive and accurate spelling correction system ever in terms of language coverage and accuracy. Learn more in this blog post.

calendar icon

March 2021

The best of AI at Scale: Semantic search capabilities available to Azure customers in preview

Microsoft Bing partnered with Azure Cognitive Search to make state-of-the-art search AI available to Azure customers through semantic search. Semantic search enables modern search experiences such as semantic ranking, extractive summarization, and machine reading comprehension. These features were built through the application of Microsoft Research technology and advancements—including UniLM, Multi-Task Deep Neural Networks, MiniLM, and utilizing graph attention networks for machine reading comprehension—to search scenarios. Deep neural network transfer learning allows the models to run well in Azure. An online A/B experiment in which semantic search was enabled for Microsoft Docs produced a significant 4.5 percent clickthrough rate increase on challenging queries (three or more words)—the largest relevance improvement the Microsoft Docs team has seen. Learn more in this blog post.

calendar icon

April 2021


DeepSpeed figureThe DeepSpeed Team releases ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40\% of peak), while also demonstrating super linear scalability.

Read the publication >

Read the blog >

DeepSpeed on GitHub >