{"id":709066,"date":"2020-12-02T11:08:35","date_gmt":"2020-12-02T19:08:35","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=709066"},"modified":"2023-10-13T11:32:30","modified_gmt":"2023-10-13T18:32:30","slug":"a-microsoft-custom-data-type-for-efficient-inference","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-microsoft-custom-data-type-for-efficient-inference\/","title":{"rendered":"A Microsoft custom data type for efficient inference"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large is-style-default gifa11y-ignore\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_NoLogo.gif\" alt=\"animated GIF of a data plot chart showing Accuracy % on the y-axis and Normalized MAC Area on the x-axis\"\/><\/figure>\n\n\n\n<p>AI is taking on an increasingly important role in many Microsoft products, such as Bing and Office 365. In some cases, it\u2019s being used to power outward-facing features like semantic search in Microsoft Word or intelligent answers in Bing, and deep neural networks (DNNs) are one key to powering these features. One aspect of DNNs is inference\u2014once these networks are trained, they use inference to make judgments about unknown information based on prior learning.<\/p>\n\n\n\n<p>In Bing, for example, DNN inference enables multiple search scenarios including feature extraction, captioning, question answering, and ranking, which are all important tasks for customers to get accurate, fast responses to their search queries. These scenarios in Bing have stringent latency requirements and need to happen at an extremely large scale. As these inference scenarios and associated models grow in complexity and scale, the resources needed for inference are growing too\u2014between 3\u20135 times per year (see Figure 1).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"940\" height=\"415\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure1.jpg\" alt=\"A scatter plot shows state-of-the-art model growth of inference needed over time. Inferences are measured in the billions of operations per second. Alexnet, 2012, just above 1 B OPS. Resnet-50, 2015, just under 10 B OPS. BERT-Large, 2018, between 100 and 1000 B OPS. Megatron and GPT-2, 2019, between 1000 and 10000 B OPS. T-NLG, 2020, between 10000 and 100000 B OPS. GPT-3, 2020, near 1000000 B OPS. GPT-3 is 90,000x Resnet-50\" class=\"wp-image-709105\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure1.jpg 940w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure1-300x132.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure1-768x339.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure1-16x7.jpg 16w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><figcaption class=\"wp-element-caption\">Figure 1: State-of-the-art models growth in size and cost, shown above as DNN inference necessary for models from 2011 until now. In five years\u2019 time since ResNet-50, models are showing an exponential growth in inference operations.<\/figcaption><\/figure>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">PUBLICATION<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point\/\" data-bi-cN=\"Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>This growth trend line underscores the importance of minimizing the cost of inference for Microsoft. Data types\u2014that is, the formats used to represent data\u2014are a key factor in the cost of storage, access, and processing of the large quantities of data involved in deep learning models. In this blog post, we present a brief introduction to MSFP, a new class of data types optimized for efficient DNN inferencing, and how it is used in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-brainwave\/\">Project Brainwave<\/a> to provide low-cost inference at production scale. Project Brainwave and its technology play a critical role in powering the infrastructure for <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-at-scale\/\">AI at Scale<\/a>, the Microsoft initiative to incorporate next-generation AI into its products and AI platforms. Our latest work is detailed in a paper accepted at the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/neurips-2020\/\">34th Conference on Neural Information Processing Systems (NeurIPS 2020)<\/a>, titled \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point\/\">Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point<\/a>.\u201d<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/neurips-2020\/\" target=\"_self\" aria-label=\"Microsoft at NeurIPS 2020\" data-bi-type=\"annotated-link\" data-bi-cN=\"Microsoft at NeurIPS 2020\" class=\"annotations__list-thumbnail\" >\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"172\" height=\"96\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-343x193.png\" class=\"mb-2\" alt=\"illustrated icons related to artificial intelligence for Microsoft's involvement at NeurIPS 2020\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/11\/1920x720_Event_Page_Banner-1920x1080.png 1920w\" sizes=\"auto, (max-width: 172px) 100vw, 172px\" \/>\t\t\t\t<\/a>\n\t\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Event<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/neurips-2020\/\" data-bi-cN=\"Microsoft at NeurIPS 2020\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Microsoft at NeurIPS 2020<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t\t\t<p class=\"annotations__caption text-neutral-400 mt-2\">Check out Microsoft's presence at NeurIPS 2020, including links to all of our NeurIPS publications, the Microsoft session schedule, and links to open career opportunities. <\/p>\n\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"microsoft-floating-point-provides-higher-accuracy-with-low-cost\">Microsoft Floating Point provides higher accuracy with low cost<\/h2>\n\n\n\n<p>Choosing how to represent numeric values in a computing platform is a trade-off between accuracy and cost. Using fewer bits per value, or moving from a floating-point to a fixed-point or integer format, decreases storage and computation costs but typically results in lower accuracy, that is, lower quality of results. MSFP is a new class of data types that deliver a fundamental shift in this trade-off, providing higher accuracy using fewer bits per value than traditional floating-point or integer formats. In particular, MSFP enables dot product operations\u2014the core of the matrix-matrix and matrix-vector multiplication operators critical to DNN inference\u2014to be performed nearly as efficiently as with integer data types, but with accuracy comparable to floating point.<\/p>\n\n\n\n<p>Floating-point formats capture a large dynamic range by using an exponent field to scale the significance of the value captured by the mantissa bits. Unfortunately, this exponent field requires additional bits to be stored with each value, and greatly complicates the process of adding or multiplying values. Eliminating the exponent, resulting in a fixed-point or integer representation, reduces computation and storage costs but at a potentially significant reduction in accuracy due to the loss of dynamic range. MSFP delivers the dynamic range of floating point at close to the cost of integer formats by associating a single exponent with a group, or \u201cbounding box\u201d, of mantissa values. MSFP is a class of data types parameterized by the number of mantissa and exponent bits and the bounding box size (see Figure 2). In this post, we\u2019ll focus on two versions of the data type, MSFP12 and MSFP16, which represent two points in the cost\/performance trade-off spectrum for MSFP types.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure2.jpg\" alt=\"MSFP uses a shared exponent to achieve the dynamic range of floating-point formats such as fp32 and bfloat16 while keeping storage and computation costs close to those of integer formats.\" class=\"wp-image-709108\" style=\"width:444px;height:462px\" width=\"444\" height=\"462\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure2.jpg 666w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure2-287x300.jpg 287w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure2-11x12.jpg 11w\" sizes=\"auto, (max-width: 444px) 100vw, 444px\" \/><figcaption class=\"wp-element-caption\">Figure 2: MSFP uses a shared exponent to achieve the dynamic range of floating-point formats such as fp32 and bfloat16 while keeping storage and computation costs close to those of integer formats.<\/figcaption><\/figure>\n\n\n\n<p>Through the coevolution of hardware design and algorithms, MSFP16 achieves 3x lower cost compared to Bfloat16, and MSFP12 achieves 4x lower cost compared to industry standard INT8 while delivering a comparable or better accuracy. Figure 3 shows the high-level overview of a systolic tensor core architecture containing multiple MSFP dot product units. Each dot product unit has a significantly lower circuit footprint compared to conventional float due to the shared exponent. We refer to the granularity at which the exponent is being shared between tensor elements as \u201cbounding box size.\u201d The math per bounding box is mostly performed in fixed-point format and the cost of dynamic scaling is amortized over the number of elements in each bounding box. To hide this complexity from software, the bounding box exponents are computed and updated in real time in hardware.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"304\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-1024x304.jpg\" alt=\"Figure 3: Systolic tensor core architecture containing multiple MSFP dot product units.  The multipliers and adders inside the dot product unit, shown on the right, operate on simple fixed-point values within a common bounding box. Only the accumulator on the far right deals with the more complex scaling issues associated with floating-point computation.\" class=\"wp-image-709111\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-1024x304.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-300x89.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-768x228.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-16x5.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3-1066x318.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure-3.jpg 1072w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Systolic tensor core architecture containing multiple MSFP dot product units. The multipliers and adders inside the dot product unit, shown on the right, operate on simple fixed-point values within a common bounding box. Only the accumulator on the far right deals with the more complex scaling issues associated with floating-point computation.<\/figcaption><\/figure>\n\n\n\n<p>Variants of MSFP together form a new Pareto frontier for computational performance\/mm<sup>2<\/sup> compared to a collection of competitive data types on commodity hardware. Figure 4 shows the accuracy\u2013area trade-off of using different data types for serving ResNet50 (ImageNet). MSFP outperforms existing data types in terms of area and energy cost while the model is held to a fixed accuracy.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"356\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure4.jpg\" alt=\"Figure 4: MSFP significantly improves upon previous data types in computational efficiency at each fixed level of accuracy. Left: relative area and energy cost of multiply-accumulate (MAC) using different data types on the same silicon. Right: ImageNet accuracy for ResNet-50 plotted versus normalized area cost. The area costs are normalized to Float32.\" class=\"wp-image-709114\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure4.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure4-300x104.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure4-768x267.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure4-16x6.jpg 16w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 4: MSFP significantly improves upon previous data types in computational efficiency at each fixed level of accuracy. Left: relative area and energy cost of multiply-accumulate (MAC) using different data types on the same silicon. Right: ImageNet accuracy for ResNet-50 plotted versus normalized area cost. The area costs are normalized to Float32.<\/figcaption><\/figure>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BLOG<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-turing-universal-language-representation-model-t-ulrv2-tops-xtreme-leaderboard\/\" data-bi-cN=\"Microsoft Turing Universal Language Representation model, T-ULRv2, tops XTREME leaderboard\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Microsoft Turing Universal Language Representation model, T-ULRv2, tops XTREME leaderboard<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>MSFP is integrated within a mature large-scale production pipeline and has been used to ship various models empowering <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-turing-universal-language-representation-model-t-ulrv2-tops-xtreme-leaderboard\/\">major online scenarios such as web search, question-answering, and image classification<\/a>. We further corroborated the efficacy of MSFP data type for inferencing various open-source benchmarks from different classes of deep learning models including CNNs, RNNs, and Transformers. Please refer to Table 1 below and the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point\/\">paper <\/a>for more details.<\/p>\n\n\n\n<table style=\"border-spacing: inherit;border-collapse: collapse;width: 100%;padding: 10px;text-align: center;font-family:Segoe UI; border: 1px solid #000000\"><thead><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>Models<\/strong><\/th><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>Float32<\/strong><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>MSFP16<\/strong><\/th><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>MSFP15<\/strong><\/th><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>MSFP14<\/strong><\/th><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>MSFP13<\/strong><\/th><th style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>MSFP12<\/strong><\/th><\/tr><\/thead><tbody><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">Resnet-50<\/td><td style=\"padding: 10px;\">1.000 (75.26)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.999<\/td><td style=\"padding: 10px;\"><strong>0.994<\/strong><\/td><td style=\"padding: 10px;\">0.989<\/td><td style=\"padding: 10px;\">0.967<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">Resnet-101<\/td><td style=\"padding: 10px;\">1.000 (76.21)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\"><strong>0.991<\/strong><\/td><td style=\"padding: 10px;\">0.964<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">Resnet-152<\/td><td style=\"padding: 10px;\">1.000 (76.58) <\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.001<\/td><td style=\"padding: 10px;\">0.997<\/td><td style=\"padding: 10px;\"><strong>0.991<\/strong><\/td><td style=\"padding: 10px;\">0.968<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">Inception-v3<\/td><td style=\"padding: 10px;\">1.000 (77.98)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.005<\/td><td style=\"padding: 10px;\">1.001<\/td><td style=\"padding: 10px;\"><strong>0.990<\/strong><\/td><td style=\"padding: 10px;\">0.943<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">Inception-v4<\/td><td style=\"padding: 10px;\">1.000 (80.18)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.001<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\"><strong>0.993<\/strong><\/td><td style=\"padding: 10px;\">0.963<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">MobileNet-V1<\/td><td style=\"padding: 10px;\">1.000 (70.90) <\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\">0.997<\/td><td style=\"padding: 10px;\"><strong>0.990<\/strong><\/td><td style=\"padding: 10px;\">0.965<\/td><td style=\"padding: 10px;\">0.863<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">VGG16<\/td><td style=\"padding: 10px;\">1.000 (70.93)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.004<\/td><td style=\"padding: 10px;\">1.005<\/td><td style=\"padding: 10px;\">1.003<\/td><td style=\"padding: 10px;\"><strong>1.002<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">VGG19<\/td><td style=\"padding: 10px;\">1.000 (71.02)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.002<\/td><td style=\"padding: 10px;\">1.001<\/td><td style=\"padding: 10px;\">1.002<\/td><td style=\"padding: 10px;\"><strong>1.000<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">EfficientNet-S<\/td><td style=\"padding: 10px;\">1.000 (77.61)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\"><strong>0.992<\/strong><\/td><td style=\"padding: 10px;\">0.979<\/td><td style=\"padding: 10px;\">0.949<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">EfficientNet-M<\/td><td style=\"padding: 10px;\">1.000 (78.98)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\"><strong>0.993<\/strong><\/td><td style=\"padding: 10px;\">0.980<\/td><td style=\"padding: 10px;\">0.950<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">EfficientNet-L<\/td><td style=\"padding: 10px;\">1.000 (80.47)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.999<\/td><td style=\"padding: 10px;\"><strong>0.993<\/strong><\/td><td style=\"padding: 10px;\">0.974<\/td><td style=\"padding: 10px;\">0.945<\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">RNN-DR<\/td><td style=\"padding: 10px;\">1.000 (76.10) <\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.008<\/td><td style=\"padding: 10px;\">1.003<\/td><td style=\"padding: 10px;\">1.009<\/td><td style=\"padding: 10px;\"><strong>1.000<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">RNN-DS<\/td><td style=\"padding: 10px;\">1.000 (73.10)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.012<\/td><td style=\"padding: 10px;\">1.005<\/td><td style=\"padding: 10px;\">1.022<\/td><td style=\"padding: 10px;\"><strong>0.992<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">BERT-MRPC<\/td><td style=\"padding: 10px;\">1.000 (88.39)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">1.005<\/td><td style=\"padding: 10px;\">1.002<\/td><td style=\"padding: 10px;\">1.008<\/td><td style=\"padding: 10px;\"><strong>1.018<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">BERT-SQuAD1.1<\/td><td style=\"padding: 10px;\">1.000 (88.45)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\">0.998<\/td><td style=\"padding: 10px;\">0.997<\/td><td style=\"padding: 10px;\"><strong>0.990<\/strong><\/td><\/tr><tr style=\"border: 1px solid #000000\"><td style=\"padding: 10px;\">BERT-SQuADv2<\/td><td style=\"padding: 10px;\">1.000 (77.23)<\/td><td style=\"padding: 10px;\">1.000<\/td><td style=\"padding: 10px;\">0.999<\/td><td style=\"padding: 10px;\">0.999<\/td><td style=\"padding: 10px;\"><strong>0.993<\/strong><\/td><td style=\"padding: 10px;\">0.989<\/td><\/tr><tr style=\"background-color:#d3d3d3;border: 1px solid #000000\"><td style=\"padding: 10px;\">Memory density<\/td><td style=\"padding: 10px;\">1.0x<\/td><td style=\"padding: 10px;\">3.8x<\/td><td style=\"padding: 10px;\">4.3x<\/td><td style=\"padding: 10px;\">4.9x<\/td><td style=\"padding: 10px;\">5.8x<\/td><td style=\"padding: 10px;\">7.1x<\/td><\/tr><tr style=\"background-color:#d3d3d3;border: 1px solid #000000\"><td style=\"padding: 10px;\">Arithmetic density<\/td><td style=\"padding: 10px;\">1.0x<\/td><td style=\"padding: 10px;\">8.8x<\/td><td style=\"padding: 10px;\">10.8x<\/td><td style=\"padding: 10px;\">13.9x<\/td><td style=\"padding: 10px;\">18.3x<\/td><td style=\"padding: 10px;\">31.9x<\/td><\/tr><\/tbody><\/table><figcaption>Table 1: Normalized accuracy of different benchmarks across a range of bit-widths. The values are normalized with respect to the Float32 model accuracy listed in column 2 in parentheses. Configurations with the lowest bit-width that stay within 1% of Float32 accuracy are shown in bold.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"project-brainwave-takes-msfp-to-production\">Project Brainwave takes MSFP to production<\/h2>\n\n\n\n<p>Microsoft leverages the benefits of MSFP via Project Brainwave, an internally developed system for real-time production-scale DNN inference in the cloud. Project Brainwave is used heavily in Bing and for our Office 365 workloads. Figure 5 shows a high-level view of the Project Brainwave Neural Processing Unit (NPU) architecture. The most critical part of the NPU is the matrix-vector multiplier (MVM) shown in the upper left. The MVM unit and associated matrix register file (MRF) and vector register file (VRF) are designed from the ground up to exploit the MSFP format.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-configurable-cloud-scale-dnn-processor-for-real-time-ai\/\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5-1024x421.jpg\" alt=\"diagram\" class=\"wp-image-709120\" style=\"width:863px;height:355px\" width=\"863\" height=\"355\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5-1024x421.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5-300x123.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5-768x316.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5-16x7.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/Brainwave_figure5.jpg 1218w\" sizes=\"auto, (max-width: 863px) 100vw, 863px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 5: Project Brainwave NPU Architecture. The thicker arrows represent the datapath connections, which carry data between the MSFP-based matrix-vector multiplier (MVM) and other function units. Data is routed via crossbar (XBAR) units. Firmware running on the control processor issues commands to the instruction dispatcher, which directs the crossbars and function units via control signals (light arrows). For more details see \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-configurable-cloud-scale-dnn-processor-for-real-time-ai\/\">A Configurable Cloud-Scale DNN Processor for Real-Time AI.<\/a>\u201d<\/figcaption><\/figure>\n\n\n\n<p>Project Brainwave builds on the field-programmable gate arrays (FPGAs) deployed in Microsoft data centers as part of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-catapult\/#:~:text=Project%20Catapult%201%20Project%20Brainwave%20leverages%20Project%20Catapult,The%20first%20hyperscale%20supercomputer.%20...%20More%20items...%20\">Project Catapult<\/a>. The fine-grain bit-level configurability of FPGAs allows us to harvest the increased computational, storage, and bandwidth efficiencies of each incremental reduction in the bit width of the values on which we operate, which is critical for taking advantage of MSFP. FPGA flexibility has also enabled us to experiment and push the boundaries of low-precision computation for DNN inference. We were able to deploy MSFP to production at low risk because the parameters of our MSFP implementation can be adjusted simply by re-synthesizing our FPGA configuration.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hardening-msfp-in-custom-silicon\">Hardening MSFP in custom silicon<\/h2>\n\n\n\n<p>Now that MSFP has been proven in large-scale production systems, a natural next step is to consider hardening the performance-critical logic into custom silicon. Working together with our partners in the Intel Programmable Solutions Group (PSG), we\u2019ve delivered significant improvement in area and energy efficiency through silicon hardening\u2014resulting in the industry\u2019s first AI-accelerated FPGA using MSFP tensor cores. The Intel Stratix 10 NX device represents the first generation of its kind bringing together Microsoft algorithms embedded into hardened silicon while continuing to harness the advantages of flexible reconfigurable hardware.<\/p>\n\n\n\n<p>In the Stratix 10 NX device, over 4,000 custom digital signal processing (DSP) blocks deliver over 250 peak teraflops of MSFP. These hardened units are uniquely architected to blend seamlessly with reconfigurable logic, supporting flexible dataflows that can be finetuned for target applications\u2014for example, the ability to scale up and down tensor cores while supporting varying degrees of sparsity. Given the rapid change in state-of-the-art inference algorithms, the NX enables Microsoft to continue evolving its hardware platform to handle and accelerate best-of-breed AI algorithms (without the drawbacks of multi-year silicon turnarounds).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"project-brainwave-gains-momentum-for-next-gen-technologies\">Project Brainwave gains momentum for next-gen technologies<\/h2>\n\n\n\n<p>MSFP represents a major turning point in the journey that started in 2016 to deliver large-scale inferencing capabilities for Microsoft. Our understanding and research of low-precision arithmetic has advanced by leaps and bounds, with our confidence strengthened by the success of dozens of state-of-the-art AI models that have been deployed for critical scenarios in Bing and Office 365 using MSFP.<\/p>\n\n\n\n<p>Project Brainwave, a platform leveraging reconfigurable logic, has provided the critical and flexible hardware platform to evolve algorithms with confidence. As research in low precision and sparsity continues to advance, we expect Project Brainwave\u2014turbocharged by hardened MSFP\u2014will continue to play a pivotal role in enabling us to deliver the next wave of advantages through algorithm codesign in hardware. We\u2019re extremely excited for what\u2019s coming next.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"acknowledgments\">Acknowledgments<\/h3>\n\n\n\n<p>Microsoft Floating Point, and this work specifically, would not be possible without the broad team of engineers and researchers involved. We\u2019d particularly like to thank <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dlo\/\">Daniel Lo<\/a> and Ritchie Zhao for their critical contributions, as well as <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mliu\/\">Ming Liu<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfowers\/\">Jeremy Fowers<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kaovt\/\">Kalin Ovtcharov<\/a>, Anya Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, Alessandro Forin, Haishan Zhu, Taesik Na, Prerak Patel, Shuai Che, Ahmad El Husseini, Lok Chand Koppaka, Xia Song, Subhojit Som, and Kaustav Das. We also thank Maximilian Golub, Rasoul Shafipour, and Greg Yang for their feedback on the draft of the paper. We thank Raja Venugopal for his support of this work.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI is taking on an increasingly important role in many Microsoft products, such as Bing and Office 365. In some cases, it\u2019s being used to power outward-facing features like semantic search in Microsoft Word or intelligent answers in Bing, and deep neural networks (DNNs) are one key to powering these features. One aspect of DNNs [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":709474,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Bita Darvish Rouhani","user_id":"37839"},{"type":"user_nicename","value":"Doug Burger","user_id":"31582"},{"type":"user_nicename","value":"Eric Chung","user_id":"31746"},{"type":"user_nicename","value":"Rangan Majumder","user_id":"38931"},{"type":"user_nicename","value":"Sangeetha Shekar","user_id":"39907"},{"type":"user_nicename","value":"Saurabh Tiwary","user_id":"39603"},{"type":"user_nicename","value":"Sitaram Lanka","user_id":"37485"},{"type":"user_nicename","value":"Steve Reinhardt","user_id":"37488"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13552,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-709066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-hardware-devices","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[649749,486102,171431],"related-events":[708199],"related-researchers":[{"type":"user_nicename","value":"Doug Burger","user_id":31582,"display_name":"Doug Burger","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dburger\/\" aria-label=\"Visit the profile page for Doug Burger\">Doug Burger<\/a>","is_active":false,"last_first":"Burger, Doug","people_section":0,"alias":"dburger"},{"type":"user_nicename","value":"Rangan Majumder","user_id":38931,"display_name":"Rangan Majumder","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ranganm\/\" aria-label=\"Visit the profile page for Rangan Majumder\">Rangan Majumder<\/a>","is_active":false,"last_first":"Majumder, Rangan","people_section":0,"alias":"ranganm"},{"type":"user_nicename","value":"Sangeetha Shekar","user_id":39907,"display_name":"Sangeetha Shekar","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sshekar\/\" aria-label=\"Visit the profile page for Sangeetha Shekar\">Sangeetha Shekar<\/a>","is_active":false,"last_first":"Shekar, Sangeetha","people_section":0,"alias":"sshekar"},{"type":"user_nicename","value":"Sitaram Lanka","user_id":37485,"display_name":"Sitaram Lanka","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/slanka\/\" aria-label=\"Visit the profile page for Sitaram Lanka\">Sitaram Lanka<\/a>","is_active":false,"last_first":"Lanka, Sitaram","people_section":0,"alias":"slanka"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-960x540.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Brainwave_still_nologo-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"December 2, 2020","formattedExcerpt":"AI is taking on an increasingly important role in many Microsoft products, such as Bing and Office 365. In some cases, it\u2019s being used to power outward-facing features like semantic search in Microsoft Word or intelligent answers in Bing, and deep neural networks (DNNs) are&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/709066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=709066"}],"version-history":[{"count":32,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/709066\/revisions"}],"predecessor-version":[{"id":976299,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/709066\/revisions\/976299"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/709474"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=709066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=709066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=709066"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=709066"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=709066"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=709066"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=709066"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=709066"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=709066"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=709066"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=709066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}