Project Turing header: electric pulse on black background

AI at Scale

Models, infrastructure and hardware for next-generation AI applications

ニュース&特集

Microsoft Research Focus 03: Week of November 7th, 2022

Microsoft Research ブログ

Research Focus: Week of November 7, 2022

11月 8, 2022

Welcome to Research Focus, a new series …

Three bar plots. The first plot shows that the model size of XTC-BERT is 32 times smaller than that of BERT, and two dots show the accuracy of BERT and XTC-BERT, which are 83.95 and 83.44, respectively. The second one shows that INT8 using ZeroQuant can be 2.6 times faster than Baseline with FP16 using PyTorch and ZeoQuant can reduce the number of GPUs for inference from 2 to 1, which in total provides 5.2 times efficiency. It also shows that ZeroQuant has 50.4 accuracy compared to 50.5 using Baseline PyTorch. The third plot shows that ZeroQuant is more than 5000 times cheaper than baseline to compress a model, and the accuracy of ZeroQuant is 42.26 compared to 42.35 of baseline.

Microsoft Research ブログ

DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

7月 20, 2022 | DeepSpeed Team と Andrey Proskurin

Large-scale models are revolutionizing d…

DeepSpeed shares findings and innovations for MoE models and systems that 1) reduce training cost by 5x, 2) reduce MoE parameter size by up to 3.7x and 3) reduce MoE inference latency by 7.3x at an unprecedented scale and offer up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models.

Microsoft Research ブログ

DeepSpeed: Advancing MoE inference and training to power next-generation AI scale

1月 19, 2022 | DeepSpeed Team と Andrey Proskurin

In the last three years, the largest tra…

SuperGLUE leaderboards showing T-NLRv5 at the top

Microsoft Research ブログ

Efficiently and effectively scaling up language model pretraining for best language representation model on GLUE and SuperGLUE

12月 2, 2021 | Jianfeng Gao と Saurabh Tiwary

As part of Microsoft AI at Scale (opens …

An illustration of how the image text contrastive and translation text contrastive tasks work together to help align the space of images, English text and non-English text. On the left side of the illustration, the three domains—Image Domain, English Domain, and Non-English Domain--are segregated. An arrow labeled “Image-Captions training data” points to another depiction of the three domains where the image domain and the English domain intersect but the non-English domain is still separate and shown in gray to show that it’s not significantly affected. A two headed arrow with the label “Image-Text contrastive loss” is drawn between the image and English domains. Towards the bottom of the image, an arrow labeled “Parallel corpus training data” points to another depiction of the three domains where the English domain and the non-English domain intersect but the image domain is separate and shown in gray to indicate that it is not significantly affected. A two-headed arrow with the label “Translated Text Contrastive loss” is drawn between the English and non-English domains. Finally, a third arrow with the label “Resulting Effect” is drawn to the right of the image which points to a depiction of all three domains intersecting.

Microsoft Research ブログ

Turing Bletchley: A Universal Image Language Representation model by Microsoft

11月 1, 2021 | Saurabh Tiwary

Today, the Microsoft Turing team (opens …

Figure 1. Trend of sizes of state-of-the-art NLP models over time

Microsoft Research ブログ

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model

10月 11, 2021 | Ali Alvi と Paresh Kharya

We are excited to introduce the DeepSpee…

XTREME leaderboard showing T-ULRv5 at the top.

Microsoft Research ブログ

Microsoft Turing Universal Language Representation model, T-ULRv5, tops XTREME leaderboard and trains 100x faster

9月 28, 2021 | Saurabh Tiwary と Lidong Zhou

Today, we are excited to announce that w…

DeepSpeed MoE powers eight times bigger models using expert-parallelism + ZeRO-Offload compared with expert-parallelism only. A graph shows supported model sizes on NVIDIA A100 GPUs. DeepSpeed MoE scales near-linearly with respect to the number of GPUs. Z-code MoE (10B) consistently outperforms other systems on BLEU scores for an in-house 50 language test dataset. Read more in the blog post.

Microsoft Research ブログ

DeepSpeed powers 8x larger MoE model training with high performance

8月 18, 2021 | DeepSpeed Team と Z-code Team

Today, we are proud to announce DeepSpee…

Technical diagram of MEB model. MEB is a sparse neural network model composed of an input layer taking in binary features, a feature embedding layer transforming each binary feature into a 15-dimension vector, a sum pooling layer applied on each of 49 feature groups and concatenated to produce a 735-dimension vector, which is then passed through two dense layers to produce a click probability. Features shown in this figure are generated from the example query “Microsoft Windows” and document www.microsoft.com/en-us/windows.

Microsoft Research ブログ

Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance

8月 4, 2021 | Junyan Chen, Frédéric Dubut, Jason (Zengzhong) Li, と Rangan Majumder

Recently, Transformer-based deep learnin…