{"id":635250,"date":"2020-02-13T13:14:29","date_gmt":"2020-02-10T17:04:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=635250"},"modified":"2020-02-13T13:14:31","modified_gmt":"2020-02-13T21:14:31","slug":"turing-nlg-a-17-billion-parameter-language-model-by-microsoft","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/","title":{"rendered":"Turing-NLG: A 17-billion-parameter language model by Microsoft"},"content":{"rendered":"<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-635634 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788.png\" alt=\"chart \" width=\"1400\" height=\"788\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/p>\n<p style=\"text-align: center;\">This figure was adapted from a similar image published in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1910.01108.pdf\">DistilBERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<blockquote>\n<p style=\"text-align: left;\"><strong><em>Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|><\/em><\/strong><\/p>\n<p style=\"text-align: left;\">\u00a0&#8211; This summary was generated by the Turing-NLG language model itself.<\/p>\n<\/blockquote>\n<p>Massive deep learning language models (LM), such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/better-language-models\/\">GPT-2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, with billions of parameters learned from essentially all the text published on the internet, have improved the state of the art on nearly every downstream natural language processing (NLP) task, including question answering, conversational agents, and document understanding among others.<\/p>\n<p>Better natural language generation can be transformational for a variety of applications, such as assisting authors with composing their content, saving one time by summarizing a long piece of text, or improving customer experience with digital assistants. Following the trend that larger natural language models lead to better results, Microsoft <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/msturing.org\">Project Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is introducing Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, which outperforms the state of the art on a variety of language modeling benchmarks and also excels when applied to numerous practical tasks, including summarization and question answering. This work would not be possible without breakthroughs produced by the <b><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\">DeepSpeed library<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/b>\u00a0(compatible with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/pytorch.org\/\">PyTorch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimization-towards-training-a-trillion-parameter-models\/\">ZeRO optimizer<\/a>, which can be explored more in this accompanying <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters\">blog post.<\/a><\/p>\n<p>We are releasing a private demo of T-NLG, including its freeform generation, question answering, and summarization capabilities, to a small set of users within the academic community for initial testing and feedback.<span style=\"text-decoration: line-through;\"><br \/>\n<\/span><\/p>\n<h3>T-NLG: Benefits of a large generative language model<\/h3>\n<p>T-NLG is a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1706.03762.pdf\">Transformer-based<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> generative language model, which means it can generate words to complete open-ended textual tasks. In addition to completing an unfinished sentence, it can generate direct answers to questions and summaries of input documents.<\/p>\n<p>Generative models like T-NLG are important for NLP tasks since our goal is to respond as directly, accurately, and fluently as humans can in any situation. Previously, systems for question answering and summarization relied on extracting existing content from documents that could serve as a stand-in answer or summary, but they often appear unnatural or incoherent. With T-NLG we can naturally summarize or answer questions about a personal document or email thread.<\/p>\n<p>We have observed that the bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples. Therefore, we believe it is more efficient to train a large centralized multi-task model and share its capabilities across numerous tasks rather than train a new model for every task individually.<\/p>\n<h3>Pretraining T-NLG: Hardware and software breakthroughs<\/h3>\n<p>Any model with more than 1.3 billion parameters cannot fit into a single GPU (even one with 32GB of memory), so the model itself must be parallelized, or broken into pieces, across multiple GPUs. We took advantage of several hardware and software breakthroughs to achieve training T-NLG:<\/p>\n<p style=\"padding-left: 40px;\">1. We leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections so that communication between GPUs is faster than previously achieved.<\/p>\n<p style=\"padding-left: 40px;\">2. We apply tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.<\/p>\n<p style=\"padding-left: 40px;\">3. DeepSpeed with\u00a0<u><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.02054\">ZeRO<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/u> allowed us to reduce the model-parallelism degree (from 16 to 4), increase batch size per node by fourfold, and reduce training time by three times. DeepSpeed makes training very large models more efficient with fewer GPUs, and it trains at batch size of 512 with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone. DeepSpeed is compatible with\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/pytorch.org\/\">PyTorch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>The resulting T-NLG model has 78 Transformer layers with a hidden size of 4256 and 28 attention heads. To make results comparable to Megatron-LM, we pretrained the model with the same hyperparameters and learning schedule as Megatron-LM using autoregressive generation loss for 300,000 steps of batch size 512 on sequences of 1024 tokens. The learning schedule followed 3,200 steps of linear warmup up to a maximum learning rate of 1.5&#215;10<sup>-4 <\/sup>and cosine decay over 500,000 steps, with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/devblogs.nvidia.com\/apex-pytorch-easy-mixed-precision-training\/\">FP16<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We trained the model on the same type of data that Megatron-LM models were trained on.<\/p>\n<p>We also compared the performance of the pretrained T-NLG model on standard language tasks such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blog.einstein.ai\/the-wikitext-long-term-dependency-language-modeling-dataset\/\">WikiText-103<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> perplexity (lower is better) and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/cybertronai\/bflm\/blob\/master\/lambada_test.jsonl\">LAMBADA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> next word prediction accuracy (higher is better). The table below shows that we achieve the new state of the art on both LAMBADA and WikiText-103. Megatron-LM is the publicly released results from the NVIDIA Megatron model.<\/p>\n<div id=\"attachment_635883\" style=\"width: 1034px\" class=\"wp-caption alignleft\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-635883\" class=\"wp-image-635883 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20-1024x276.jpg\" alt=\"table\" width=\"1024\" height=\"276\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20-1024x276.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20-300x81.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20-768x207.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20-1536x413.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-_Updated-2.9.20.jpg 1824w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-635883\" class=\"wp-caption-text\">*Open AI used additional processing (stopword filtering) to achieve higher numbers than the model achieved alone. Neither Megatron nor T-NLG use this stopword filtering technique.<\/p><\/div>\n<p>Figure 1 below shows how T-NLG performs when compared with Megatron-LM on validation perplexity.<\/p>\n<div id=\"attachment_636072\" style=\"width: 810px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-636072\" class=\"wp-image-636072 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/MicrosoftTeams-image-2.jpg\" alt=\"Chart\" width=\"800\" height=\"546\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/MicrosoftTeams-image-2.jpg 800w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/MicrosoftTeams-image-2-300x205.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/MicrosoftTeams-image-2-768x524.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><p id=\"caption-attachment-636072\" class=\"wp-caption-text\">Figure 1: Comparison of the validation perplexity of Megatron-8B parameter model (orange line) vs T-NLG 17B model during training (blue and green lines). The dashed line represents the lowest validation loss achieved by the current public state of the art model. The transition from blue to green in the figure indicates where T-NLG outperforms public state of the art.<\/p><\/div>\n<h3 style=\"text-align: left;\">Direct question answering and zero shot question capabilities<\/h3>\n<p>Many web search users are accustomed to seeing a direct answer card displayed at the top of the results page when they ask a question. Most of those cards show an answer sentence within the context of the paragraph it originated from. Our goal is to more plainly satisfy users\u2019 information needs by responding directly to their question. For instance, most search engines would highlight the name \u201cTristan Prettyman\u201d below when showing the full passage (see example below).<\/p>\n<table class=\"aligncenter\" style=\"width: 90%; border-collapse: separate; border-spacing: inherit; border: 1px solid;\" cellspacing=\"inherit\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid;\"><strong>Question<\/strong><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid;\">Who was Jason Mraz engaged to?<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid; height: 49px;\"><strong>Passage<\/strong><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid; height: 49px;\">Mraz was engaged to singer\/songwriter and long-time close friend <strong>Tristan Prettyman<\/strong> on Christmas Eve 2010; they broke off the engagement six months later.<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid;\"><strong>&#8220;Direct&#8221; Answer<\/strong><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid;\">Jason Mraz was engaged to Tristan Prettyman.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>Instead, T-NLG will directly answer the question with a complete sentence. This capability is more important outside of web search\u2014for example, this can power AI assistants to intelligently respond when a user asks a question about their personal data such as emails or Word documents.<\/p>\n<p>The model is also capable of \u201czero shot\u201d question answering, meaning answering without a context passage. For the examples below, there was no passage given to the model, just the question. In these cases, the model relies on knowledge gained during pretraining to generate an answer.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-635292 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-3.jpg\" alt=\"table\" width=\"1254\" height=\"109\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-3.jpg 1254w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-3-300x26.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-3-1024x89.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-3-768x67.jpg 768w\" sizes=\"auto, (max-width: 1254px) 100vw, 1254px\" \/><\/p>\n<p>Since <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/ROUGE_(metric)\">ROUGE scores<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with the ground truth answer don\u2019t capture other aspects of quality, like factual correctness and grammatical correctness, we asked human annotators to evaluate those qualities for our previous baseline system\u2014an LSTM model similar to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1603.06393\">CopyNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014and our current T-NLG model. There is still work to be done to enable automatic evaluation of factual correctness.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-635298 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-4.jpg\" alt=\"table\" width=\"859\" height=\"148\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-4.jpg 859w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-4-300x52.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-4-768x132.jpg 768w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/p>\n<p>We also note that a larger pretrained model requires fewer instances of downstream tasks to learn them well. We only had, at most, 100,000 examples of \u201cdirect\u201d answer question-passage-answer triples, and even after only a few thousand instances of training, we had a model that outperformed the LSTM baseline that was trained on multiple epochs of the same data. This observation has real business impact, since it is expensive to collect annotated supervised data.<\/p>\n<h3>Abstractive summarization with less supervision<\/h3>\n<p>There are two types of summarization in the NLP literature: <em>extractive<\/em>\u2014taking a small number of sentences from the document as a surrogate of a summary\u2014and <em>abstractive<\/em>\u2014generating a summary with an NLG model as a human would. Rather than copying existing content, our goal for T-NLG is to write human-like abstractive summaries for a wide range of text documents: emails, blog posts, Word documents, and even Excel sheets and PowerPoint presentations. One of the main challenges is a lack of supervised training data for all these scenarios: humans don\u2019t always explicitly summarize each of these document types. The power of T-NLG is that it is already so adept at understanding text that it doesn\u2019t need much supervision to outperform all the techniques we\u2019ve employed previously.<\/p>\n<p>To make T-NLG as versatile as possible for summarizing different types of text, we finetuned the T-NLG model in a multi-task fashion on nearly all publicly available summarization datasets, amounting to approximately four million training instances. We report <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/ROUGE_(metric)\">ROUGE scores<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (a proxy for how well the generated summary exactly matches the unigrams and bigrams in a reference summary) to compare with another recent Transformer-based language model known as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1912.08777\">PEGASUS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and previous state of the art models.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-636075 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-Update.jpg\" alt=\"table\" width=\"1409\" height=\"459\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-Update.jpg 1409w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-Update-300x98.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-Update-1024x334.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/Turing-Table-Update-768x250.jpg 768w\" sizes=\"auto, (max-width: 1409px) 100vw, 1409px\" \/><br \/>\nNote that T-NLG is trained in a multi-task fashion, where it is trained on all datasets simultaneously. Since ROUGE evaluation is known to be flawed for summarization tasks, (a good abstractive summary can be written in many ways), we include a few output summaries of publicly available articles for comparison below.<\/p>\n<table class=\"aligncenter\" style=\"width: 90%; border-collapse: separate; border-spacing: inherit; border: 1px solid;\" cellspacing=\"inherit\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid;\"><strong>Link to original text<\/strong><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid;\"><strong>T-NLG Summary<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blogs.microsoft.com\/blog\/2020\/01\/16\/microsoft-will-be-carbon-negative-by-2030\/\">\u201cMicrosoft will be carbon negative by 2030\u201d by Brad Smith, Official Microsoft Blog&#8221;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid;\">Microsoft is committed to being carbon negative by 2030. We are launching an aggressive program to cut our carbon emissions by more than half by 2030, both for our direct emissions and for our entire supply and value chain. We are also launching an initiative to use Microsoft technology to help our suppliers and customers reduce their own carbon footprints and a new $1 billion climate innovation fund to accelerate the development of carbon reduction, capture, and removal technologies that will help us and the world become carbon negative. In addition to our aggressive carbon goals, we are launching a new Climate Innovation Fund to accelerate carbon reduction and removal opportunities. We are also launching a program to use our technology to improve the efficiency of our supply chain and reduce our own carbon footprint as well&#8230;<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 25%; padding: 5px; border: 1px solid;\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.newyorker.com\/magazine\/2019\/10\/14\/can-a-machine-learn-to-write-for-the-new-yorker\">&#8220;The Next Word: Where will predictive text take us?&#8221; by John Seabrook, <span style=\"font-style: italic;\">The New Yorker<\/span><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td>\n<td style=\"width: 75%; padding: 5px; border: 1px solid;\">&#8220;Smart&#8221; compose is a feature that suggests what going to type next, based on your past e-mails and your current thoughts. But what happens when you let the A.I. decide what you want to say next?<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"margin: 0in; font-family: Calibri; font-size: 16.0pt;\"><span style=\"font-weight: bold;\">\u00a0<\/span><\/p>\n<h3>T-NLG future applications<\/h3>\n<p>T-NLG has advanced the state of the art in natural language generation, providing new opportunities for Microsoft and our customers. Beyond saving our users time by summarizing documents and emails, T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document. Furthermore, it paves the way for more fluent <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/dev.botframework.com\/\">chatbots<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/cortana\/\">digital assistants<\/a>, as natural language generation can help businesses with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/dynamics.microsoft.com\/en-us\/crm\/what-is-crm\/\">customer relationship management and sales<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> by conversing with customers. We are excited by the new possibilities as we continue to advance the quality of language models.<\/p>\n<p><strong>About Project Turing: <\/strong>T-NLG is part of a larger initiative called <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/msturing.org\">Project Turing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, an applied research group that works to evolve Microsoft products with the adoption of deep learning for both text and image processing. Our work is actively being integrated into multiple Microsoft products including Bing, Office, and Xbox. If you are excited about cutting-edge deep learning research and applications in NLP, or want to learn more, please see our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/careers.microsoft.com\/us\/en\/search-results?keywords=%23msturingjobs\">careers page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>If you would like to nominate your organization for a private preview of Semantic Search by Project Turing, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/forms.microsoft.com\/Pages\/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7r1NVU5s6lGuA81eKEudjNUMFRLUFJOQUVHUDlORkZYQkg1RDVDOFQ2TS4u\">submit here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This figure was adapted from a similar image published in DistilBERT. Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":636135,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Corby Rosset","user_id":"38922"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-635250","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[678390,649749],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/02\/TurningNGL_Model__1400x788-5e418cff76a2a.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Corby Rosset","formattedDate":"February 13, 2020","formattedExcerpt":"This figure was adapted from a similar image published in DistilBERT. Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/635250","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=635250"}],"version-history":[{"count":89,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/635250\/revisions"}],"predecessor-version":[{"id":637104,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/635250\/revisions\/637104"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/636135"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=635250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=635250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=635250"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=635250"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=635250"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=635250"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=635250"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=635250"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=635250"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=635250"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=635250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}