{"id":764275,"date":"2021-08-04T10:08:26","date_gmt":"2021-08-04T17:08:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=764275"},"modified":"2021-08-04T10:08:28","modified_gmt":"2021-08-04T17:08:28","slug":"make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance\/","title":{"rendered":"Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1024x576.jpg\" alt=\"\" class=\"wp-image-764488\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Recently, Transformer-based deep learning models like GPT-3 have been getting a lot of attention in the machine learning world. These models excel at understanding semantic relationships, and they have contributed to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus\/\">large improvements in Microsoft Bing\u2019s search experience<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/\">surpassing human performance on the SuperGLUE academic benchmark<\/a>. However, these models can fail to capture more nuanced relationships between query and document terms beyond pure semantics.<\/p>\n\n\n\n<p>In this blog post, we are introducing \u201cMake Every feature Binary\u201d (MEB), a large-scale sparse model that complements our production Transformer models to improve search relevance for Microsoft customers using <a href=\"https:\/\/www.microsoft.com\/en-US\/ai\/ai-at-scale\">AI at Scale<\/a>. To make search more accurate and dynamic, MEB better harnesses the power of large data and allows for an input feature space with over 200 billion binary features that reflect the subtle relationships between search queries and documents.<\/p>\n\n\n\n<h2 id=\"why-make-every-feature-binary-to-improve-search\">Why \u201cMake Every feature Binary\u201d to improve search?<\/h2>\n\n\n\n<p>One reason MEB works so well as a complement to Transformer-based deep learning models for search relevance is that it can map single facts to features, allowing MEB to gain a more nuanced understanding of individual facts. For example, many deep neural network (DNN) language models might overgeneralize when filling in the blank in this sentence: \u201c(blank) can fly.\u201d Since the majority of DNN training cases result in \u201cbirds can fly,\u201d DNN language models might only fill the blank with the word \u201cbirds.\u201d<\/p>\n\n\n\n<p>MEB avoids this by assigning each fact to a feature, so it can assign weights that distinguish between the ability to fly in, say, a penguin and a puffin. It can do this for each of the characteristics that make a bird\u2014or any entity or object for that matter\u2014singular. Instead of saying \u201cbirds can fly,\u201d MEB paired with Transformer models can take this to another level of classification, saying \u201cbirds can fly, except ostriches, penguins, and these other birds.\u201d<\/p>\n\n\n\n<p>There\u2019s also an element of improving the method for using data more efficiently as scale increases. The ranking of web results in Bing is a machine learning problem that benefits from learning over huge amounts of user data. A traditional approach for leveraging click data is to extract thousands of handcrafted numeric features for each impressed query\/document pair and to train a gradient boosted decision tree (GBDT) model.<\/p>\n\n\n\n<p>However, even the state-of-the-art GBDT trainer, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Microsoft\/LightGBM\">LightGBM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, converges after hundreds of millions of rows of data, due to limited feature representation and model capacity. Also, these handcrafted numeric features tend to be very coarse by nature. For example, they can capture how many times the term at a given position in the query occurs in the document, but information about what the specific term is gets lost in this representation. Additionally, the features in this method don\u2019t always accurately account for things like word order in the search query.<\/p>\n\n\n\n<p>To unlock the power of huge data and enable feature representation that better reflects the relationships between queries and documents, MEB is trained with more than 500 billion query\/document pairs from three years of Bing searches. The input feature space has more than 200 billion binary features. With <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.eecs.tufts.edu\/~dsculley\/papers\/ad-click-prediction.pdf\">FTRL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the latest version is a sparse neural network model with 9 billion features and over 135 billion parameters.<\/p>\n\n\n\n<h2 id=\"uncovering-hidden-intent-with-the-largest-universal-model-served-at-microsoft\">Uncovering hidden intent with the largest universal model served at Microsoft<\/h2>\n\n\n\n<p>MEB is running in production for 100 percent of Bing searches, in all regions and languages. It is the largest universal model we\u2019re serving at Microsoft, and it demonstrates an excellent ability to memorize facts represented by these binary features while reliably learning from vast amounts of data in a continuous way.<\/p>\n\n\n\n<p>We\u2019ve empirically found that training over this large amount of data is a unique capability of large sparse neural networks. When feeding the same Bing logs to a LightGBM model and training with traditional numeric features, such as BM25 and other kinds of query and document matching features, the model quality no longer improves after one month of data is used. This indicates that the model capacity is not enough to benefit from larger amounts of data. In contrast, MEB is trained on three years of data, and we have found that it continues to learn with more data added, indicating that model capacity increases with newly added data.<\/p>\n\n\n\n<p>When compared to Transformer-based deep learning models, the MEB model also demonstrates interesting capabilities to learn beyond semantic relationships. When looking into the top features learned by MEB, we found it can learn hidden intents between query and document.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"389\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1-1024x389.png\" alt=\"A table shows three columns labeled query term, doc term, and learned relationship. Row 1: Hotmail, Microsoft Outlook, hidden Intent. Row 2: Fox 31, KDVR, hidden Intent. Row 3: baseball, hockey, negative relation. Row 4: Chinese symbol for yoga, Chinese symbol for singing and dancing, negative relation.\" class=\"wp-image-764278\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1-1024x389.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1-300x114.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1-768x292.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1-240x91.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig1.png 1169w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 1: Examples learned by MEB model<\/figcaption><\/figure><\/div>\n\n\n\n<p>For example, MEB learned that \u201cHotmail\u201d is strongly correlated to \u201cMicrosoft Outlook,\u201d even though they are not close to each other in terms of semantic meaning. MEB picks up on a nuanced relationship between these words: Hotmail was a free web-based email service provided by Microsoft that later changed its name to Microsoft Outlook. Similarly, it learned a strong connection between \u201cFox31\u201d and \u201cKDVR,\u201d where KDVR is the call sign of the TV channel in Denver, CO, that\u2019s operating under the brand Fox31. Once again, there\u2019s no overt semantic connection between the two phrases.<\/p>\n\n\n\n<p>More interestingly, MEB can identify negative relationships between words or phrases, revealing what users do not want to see for a query. For example, users searching for \u201cbaseball\u201d usually do not click on pages talking about \u201chockey\u201d even though they are both popular sports. The same applies when users search for \u745c\u4f3d (yoga) but do not click on documents containing \u6b4c\u821e (dancing and singing). Understanding these negative relationships can help to omit irrelevant search results.<\/p>\n\n\n\n<p>These relationships learned by MEB are very complementary to the ones learned by Transformer-based DNN models. The benefits for search relevance and user experience are very clear. The introduction of MEB on top of our production Transformer models resulted in:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>An almost <strong>2 percent increase on clickthrough rate (CTR) on the top search results.<\/strong> Those results found \u201cabove the fold\u201d without the need to scroll down.<\/li><li><strong>A reduction in manual query reformulation by more than 1 percent. <\/strong>Users needing to manually reformulate queries means they didn\u2019t like the results they found with their original query.<\/li><li><strong>A reduction of clicks on pagination by over 1.5 percent. <\/strong>Users needing to click on the \u201cnext page\u201d button means they didn\u2019t find what they were looking for on the first page.<\/li><\/ol>\n\n\n\n<h2 id=\"how-meb-trains-on-data-and-serves-features-at-large-scale\">How MEB trains on data and serves features at large scale<\/h2>\n\n\n\n<h3 id=\"model-structure\">Model Structure<\/h3>\n\n\n\n<p>As illustrated in Figure 1, the MEB model is composed of a binary feature input layer, a feature embedding layer, a pooling layer, and two dense layers. The input layer contains 9 billion features, generated from 49 feature groups, with each binary feature encoded into a 15-dimension embedding vector. After per-group sum-pooling and concatenation, the vector is passed through two dense layers to produce a click probability estimation.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig2.png\" alt=\"Technical diagram of MEB model. MEB is a sparse neural network model composed of an input layer taking in binary features, a feature embedding layer transforming each binary feature into a 15-dimension vector, a sum pooling layer applied on each of 49 feature groups and concatenated to produce a 735-dimension vector, which is then passed through two dense layers to produce a click probability. Features shown in this figure are generated from the example query \u201cMicrosoft Windows\u201d and document www.microsoft.com\/en-us\/windows.\" class=\"wp-image-764281\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig2.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig2-300x166.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig2-768x425.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig2-240x133.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption>Figure 1: MEB is a sparse neural network model composed of an input layer taking in binary features, a feature embedding layer transforming each binary feature into a 15-dimension vector, a sum pooling layer applied on each of 49 feature groups and concatenated to produce a 735-dimension vector, which is then passed through two dense layers to produce a click probability. Features shown in this figure are generated from the example query \u201cMicrosoft Windows\u201d and document https:\/\/www.microsoft.com\/en-us\/windows, as described in Figure 2.<\/figcaption><\/figure><\/div>\n\n\n\n<h3 id=\"training-data-and-unifying-features-as-binary\">Training data and unifying features as binary<\/h3>\n\n\n\n<p>MEB uses three years of search logs from Bing as training data. For each Bing search impression, we use heuristics to determine if the user was satisfied with the document(s) they clicked. We label these \u201csatisfactory\u201d documents as positive samples. Other documents in the same impression are labeled as negative samples. For each query and document pair, binary features are extracted from the query text, the document URL, title, and body text. These features are fed into a sparse neural network model to minimize the cross-entropy loss between the model\u2019s predicted click probability and the actual click label.<\/p>\n\n\n\n<p>Feature design and large-scale training are key to the success of MEB. MEB features are defined on the very specific term\u2013level or N-gram\u2013level relationship between query and document, which can&#8217;t be captured by traditional numeric features that only care about the matching count between query and document. (N-grams are a sequence of N terms.) To fully release the power of this large-scale training platform, all the features are designed as binary features, which can easily cover manually crafted numeric features and features directly extracted from raw text in a consistent way. Doing so allows MEB to do end-to-end optimization in one path. The current production model uses three main types of features, which are described below.<\/p>\n\n\n\n<h3 id=\"query-and-document-n-gram-pair-features\">Query and Document N-gram pair features<\/h3>\n\n\n\n<p>N-gram pair features are generated based on N-gram combinations of query and document fields from the Bing search logs. As demonstrated in Figure 2, N-grams from the query text will combine with N-grams from the document URL, title, and body text to form N-gram pair features. Longer N-grams (for higher values of N) are able to capture richer and more nuanced concepts. However, processing them is exponentially more expensive as N increases. In our production model, N is set to 1 and 2 (unigrams and bigrams respectively).<\/p>\n\n\n\n<p>We also generate features by combining the entire query text and document field. For example, the feature \u201cQuery_Title_Microsoft Windows_Explore Windows 10 OS Computer Apps More Microsoft\u201d is a feature generated from query=\u201dMicrosoft Windows\u201d and document title=\u201dExplore Windows 10 OS Computer Apps More Microsoft\u201d.<\/p>\n\n\n\n<h3 id=\"one-hot-encoding-of-bucketized-numeric-features\">One-hot encoding of bucketized numeric features<\/h3>\n\n\n\n<p>Numeric features are transformed into binary format by first bucketizing them and then by applying one-hot encoding. In the example depicted in Figure 2, the numeric feature \u201cQueryLength\u201d can take any integer value between 1 to MaxQueryLength. We define MaxQueryLength buckets for this feature so that query \u201cMicrosoft Windows\u201d has the binary feature QueryLength_2 equal to 1.<\/p>\n\n\n\n<h3 id=\"one-hot-encoding-of-categorical-features\">One-hot encoding of categorical features<\/h3>\n\n\n\n<p>Categorical features can be transformed into binary features through one-hot encoding in a straightforward way. For example, UrlString is a categorical feature with each unique URL string text as a different category.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"974\" height=\"279\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB-Blog_Fig3.png\" alt=\"Two columns. On the left is an example query document pair including a query, document title, URL, and snippet as input. On the right are some of the typical features represented by MEB. These include Query and Title feature, Query length feature, URL string feature, and query unigram and title unigram features. Read caption for more specifics. \" class=\"wp-image-764284\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB-Blog_Fig3.png 974w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB-Blog_Fig3-300x86.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB-Blog_Fig3-768x220.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB-Blog_Fig3-240x69.png 240w\" sizes=\"auto, (max-width: 974px) 100vw, 974px\" \/><figcaption>Figure 2: Above is an example of how MEB features look. The left side illustrates an example query document pair, with query text, document title, URL, and snippet as input to feature extraction. The right side demonstrates some typical features produced by MEB. For example, query \u201cMicrosoft Windows\u201d and document title \u201cExplore Windows 10 OS, Computers, Apps, & More | Microsoft\u201d produces a Query x Title feature \u201cQuery:Microsoft Windows_Title:Explore Windows 10 OS Computer Apps More Microsoft.\u201d As query \u201cMicrosoft Windows\u201d contains two terms, a binary feature \u201cQueryLength_2\u201d is produced. Each combination of query term and document title term can produce a list of Query unigram x Title unigram features, such as \u201cQTerm:Microsoft_TitleTerm:Explore\u201d and others.<\/figcaption><\/figure><\/div>\n\n\n\n<h3 id=\"continuous-training-supports-a-trillion-query-document-pairs-and-refreshes-daily\">Continuous training supports a trillion query\/document pairs and refreshes daily<\/h3>\n\n\n\n<p>To enable training with such a huge feature space, we leveraged Woodblock, an internal large-scale training platform built by the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/about.ads.microsoft.com\/en-us\/h\/a\/microsoft-advertising?ef_id=a34847914f4c184c15292760a4561a7d:G:s&OCID=AID2200059_SEM_a34847914f4c184c15292760a4561a7d:G:s&s_cid=US-ACQ-PPC-src_BNG-sub_prosp-cat_Brand_mt_b&msclkid=a34847914f4c184c15292760a4561a7d\">Microsoft Advertising<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> team. It is a distributed, large-scale, and high-performance solution for training large sparse models. Built on top of TensorFlow, Woodblock fills the gap between general deep learning frameworks and industrial requirements for billions of sparse features. With deep optimization on I\/O and data processing, it can train hundreds of billions of features within hours using CPU and GPU clusters.<\/p>\n\n\n\n<p>Even with the Woodblock pipeline, training MEB with three years of Bing search logs that contain almost a trillion query\/document pairs would be very hard to do in one shot. Instead, we apply a continuous training approach, with each month of new data continuously training the previous model, trained on top of the previous months of data.<\/p>\n\n\n\n<p>More importantly, even after implementation into Bing, the model is refreshed daily by continuously training with the latest daily click data, as illustrated in Figure 3. To avoid the negative impact of stale features, an auto-expiration strategy checks each feature&#8217;s timestamp and filters out features that have not shown up in the last 500 days. After continuous training, the daily deployment of the updated model is fully automated.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"734\" height=\"693\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig4.png\" alt=\"Flowchart illustrating how MEB is refreshed daily. From top to bottom: A blue cylinder is labeled Production MEB model. This points down to green rectangle labeled MEB Continue Train. On the left, a blue cylinder labeled Bing search log latest one day data also points to MEB Continue Train. MEB Continue Train points down to purple diamond labeled \u201cFeature show up in past 500 days?\u201d Answering no points to green rectangle labeled Filter the features from model. Answering yes points to green rectangle labeled produce updated model. Green rectangle Filter the features from model also points to rectangle labeled produce updated model. An arrow points down from produce updated model rectangle and another arrow points back to Production MEB Model.  \" class=\"wp-image-764287\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig4.png 734w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig4-300x283.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig4-191x180.png 191w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><figcaption>Figure 3: Above is a flowchart illustrating how MEB is refreshed on a daily basis. Production MEB model is continuously trained with the latest single-day Bing search log data every day. Stale features not showing up in the past 500 days are removed from the model before the new model is deployed and served online. This keeps the freshness of features and ensures model capacity is efficiently utilized.<\/figcaption><\/figure><\/div>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the latest episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h3 id=\"serving-an-extremely-large-model-using-the-bing-objectstore-platform\">Serving an extremely large model using the Bing ObjectStore platform<\/h3>\n\n\n\n<p>The MEB sparse neural network model occupies 720 GB when loaded into memory. During peak traffic time, the system needs to sustain 35 million feature lookups per second, and so it is not possible to serve MEB from a single machine. Instead, we leverage Bing\u2019s homegrown <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fresearch%2Fblog%2Fevolution-bings-objectstore%2F&data=04%7C01%7CJunyan.Chen%40microsoft.com%7C1df927f6a1524ee57fec08d941123d34%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637612368881862856%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=27jmcZg77iOd5O5Na0HCI1XZZ%2B2gSyc%2FDogfPaQIZI0%3D&reserved=0\">ObjectStore<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> service to host and serve the MEB model.<\/p>\n\n\n\n<p>ObjectStore is a multi-tenant, distributed key-value store supporting both data and compute hosting. The feature embedding layer of MEB is implemented as a table lookup operation in ObjectStore, with each binary feature hash used as a key to retrieve its embedding produced at training time. The pooling and dense layer parts are more compute-heavy and are executed in an ObjectStore Coproc\u2014a near-data compute unit\u2014hosting a user-defined function. MEB separates compute and data serving into different shards. Each compute shard takes a portion of production traffic for neural-network processing, and each data shard hosts a portion of the model data, as shown in Figure 4.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"472\" height=\"551\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig5.png\" alt=\"On left, three boxes vertically stacked are labeled Compute Shards. Each box contains the text Machine NN CoProc, with an arrow circling NN CoProc. There are three dots between the bottom two boxes. On the right, two boxes vertically stacked are labeled Data Shards. First box reads Machine and contains two blue cylinders vertically stacked, labeled P1 and P2 respectively. Second box reads Machine and contains two blue cylinders vertically stacked, labeled P_N and P_M respectively. There are two arrows from each of the Compute Shard boxes on the left that point to both Data Shard boxes on the right, six arrows in total. \" class=\"wp-image-764290\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig5.png 472w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig5-257x300.png 257w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/MEB_Blog_Fig5-154x180.png 154w\" sizes=\"auto, (max-width: 472px) 100vw, 472px\" \/><figcaption>Figure 4: ObjectStore Coproc in compute shards communicates with data shards to retrieve feature embeddings and runs neural network. Data Shards store feature embedding table and support lookup requests from each Coproc call.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Since most workloads running on ObjectStore are exclusively doing storage lookups, co-locating the MEB compute shards and in-memory data shards allows us to maximize the ObjectStore compute and memory resources in the multi-tenant cluster. With the shards distributed across many machines, we are also able to fine-control the load on each machine such that the single digit milliseconds serving latency could be achieved in MEB.<\/p>\n\n\n\n<h3 id=\"powering-faster-search-that-understands-content-better\">Powering faster search that understands content better<\/h3>\n\n\n\n<p>We\u2019ve found very large sparse neural networks like MEB can learn nuanced relationships complementary to the capabilities of Transformer-based neural networks. This improved understanding of search language results in significant benefits to the entire search ecosystem:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Thanks to improved search relevance, Bing users are able to find content and achieve tasks faster, with a reduced need to reformulate their queries or go beyond page 1.<br><\/li><li>Because MEB understands content better, publishers and webmasters get more traffic to their properties, and they can focus on satisfying their customers instead of spending time finding the right keyword that will help them rank higher. A concrete example is product rebranding, where the MEB model may be able to learn the relationship between the old and new name automatically, much like it did for \u201cHotmail\u201d and \u201cMicrosoft Outlook.\u201d<\/li><\/ul>\n\n\n\n<p>If you are using DNNs to power your business, we recommend experimenting with large sparse neural networks to complement those models. This is especially true if you have a large historical stream of user interactions and can easily construct simple binary features. If you go down this path, we do recommend making sure the model is updated in as near to real-time as possible.<\/p>\n\n\n\n<p>MEB is just one example of how our team is creating impactful cutting-edge technology that advances scale and efficiency for improved search. If you\u2019re interested in large-scale modeling for search and recommendations, the Core Search & AI team is hiring! You can find our current openings on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/careers.microsoft.com\/us\/en\/search-results?keywords=%23semanticsearch%23\">Microsoft Careers website<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently, Transformer-based deep learning models like GPT-3 have been getting a lot of attention in the machine learning world. These models excel at understanding semantic relationships, and they have contributed to large improvements in Microsoft Bing\u2019s search experience (opens in new tab) and surpassing human performance on the SuperGLUE academic benchmark. However, these models can [&hellip;]<\/p>\n","protected":false},"author":40519,"featured_media":764488,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Junyan Chen","user_id":"37332"},{"type":"user_nicename","value":"Fr\u00e9d\u00e9ric Dubut","user_id":"40528"},{"type":"user_nicename","value":"Jason (Zengzhong) Li","user_id":"40543"},{"type":"user_nicename","value":"Rangan Majumder","user_id":"38931"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-764275","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[649749],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Junyan Chen","user_id":37332,"display_name":"Junyan Chen","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/junyanch\/\" aria-label=\"Visit the profile page for Junyan Chen\">Junyan Chen<\/a>","is_active":false,"last_first":"Chen, Junyan","people_section":0,"alias":"junyanch"},{"type":"user_nicename","value":"Jason (Zengzhong) Li","user_id":40543,"display_name":"Jason (Zengzhong) Li","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jasol\/\" aria-label=\"Visit the profile page for Jason (Zengzhong) Li\">Jason (Zengzhong) Li<\/a>","is_active":false,"last_first":"Li, Jason (Zengzhong)","people_section":0,"alias":"jasol"},{"type":"user_nicename","value":"Rangan Majumder","user_id":38931,"display_name":"Rangan Majumder","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ranganm\/\" aria-label=\"Visit the profile page for Rangan Majumder\">Rangan Majumder<\/a>","is_active":false,"last_first":"Majumder, Rangan","people_section":0,"alias":"ranganm"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-960x540.jpg\" class=\"img-object-cover\" alt=\"Technical diagram of MEB model. MEB is a sparse neural network model composed of an input layer taking in binary features, a feature embedding layer transforming each binary feature into a 15-dimension vector, a sum pooling layer applied on each of 49 feature groups and concatenated to produce a 735-dimension vector, which is then passed through two dense layers to produce a click probability. Features shown in this figure are generated from the example query \u201cMicrosoft Windows\u201d and document www.microsoft.com\/en-us\/windows.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/08\/1400x788_MEB_no_logo_still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/junyanch\/\" title=\"Go to researcher profile for Junyan Chen\" aria-label=\"Go to researcher profile for Junyan Chen\" data-bi-type=\"byline author\" data-bi-cN=\"Junyan Chen\">Junyan Chen<\/a>, Fr\u00e9d\u00e9ric Dubut, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jasol\/\" title=\"Go to researcher profile for Jason (Zengzhong) Li\" aria-label=\"Go to researcher profile for Jason (Zengzhong) Li\" data-bi-type=\"byline author\" data-bi-cN=\"Jason (Zengzhong) Li\">Jason (Zengzhong) Li<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ranganm\/\" title=\"Go to researcher profile for Rangan Majumder\" aria-label=\"Go to researcher profile for Rangan Majumder\" data-bi-type=\"byline author\" data-bi-cN=\"Rangan Majumder\">Rangan Majumder<\/a>","formattedDate":"August 4, 2021","formattedExcerpt":"Recently, Transformer-based deep learning models like GPT-3 have been getting a lot of attention in the machine learning world. These models excel at understanding semantic relationships, and they have contributed to large improvements in Microsoft Bing\u2019s search experience (opens in new tab) and surpassing human&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/764275","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40519"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=764275"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/764275\/revisions"}],"predecessor-version":[{"id":764743,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/764275\/revisions\/764743"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/764488"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=764275"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=764275"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=764275"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=764275"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=764275"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=764275"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=764275"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=764275"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=764275"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=764275"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=764275"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}