{"id":573408,"date":"2019-03-18T09:00:17","date_gmt":"2019-03-18T16:00:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=573408"},"modified":"2019-06-07T11:38:49","modified_gmt":"2019-06-07T18:38:49","slug":"towards-universal-language-embeddings","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/towards-universal-language-embeddings\/","title":{"rendered":"Towards universal language embeddings"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-573774 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<p>Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks.<\/p>\n<p>Two popular approaches to learning language embeddings are language model pre-training and multi-task learning (MTL). While the former learns universal language embeddings by leveraging large amounts of unlabeled data, MTL is effective to leverage supervised data from many related tasks and profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned embeddings universal across tasks.<\/p>\n<p>Researchers at Microsoft have released <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multi-task-deep-neural-networks-for-natural-language-understanding-2\/\">MT-DNN<\/a>\u2014a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/namisan\/mt-dnn\">Multi-Task Deep Neural Network<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> model for learning universal language embeddings. MT-DNN combines the strengths of MTL and language model pretraining of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1810.04805.pdf%3Ffbclid%3DIwAR3FQiWQzP7stmPWZ4kzrGmiUaN81UpiNeq4GWthrxmwgX0B9f1CvuXJC2E&data=04%7C01%7Cjfgao%40microsoft.com%7C55a83dbe6d254a2c992408d6a680c429%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879469714101450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C-1&sdata=snw31qhBvwU4ALKceA4wtX6HdY87fHwK6R6hpRrNkiw%3D&reserved=0\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and outperforms BERT on 10 NLU tasks, creating new state-of-the-art results across many popular NLU benchmarks, including General Language Understanding Evaluation (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fgluebenchmark.com%2Fleaderboard&data=04%7C01%7Cjfgao%40microsoft.com%7C55a83dbe6d254a2c992408d6a680c429%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879469714111444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C-1&sdata=%2Fpp%2BhPJx7e14ZhTtTxuWnawjxQd0m8qM8TpEbtagAaU%3D&reserved=0\">GLUE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>), Stanford Natural Language Inference (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fnlp.stanford.edu%2Fprojects%2Fsnli%2F&data=04%7C01%7Cjfgao%40microsoft.com%7C55a83dbe6d254a2c992408d6a680c429%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879469714111444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C-1&sdata=5Q2AHDFAVvVQUL3pInPPWXpZ6OYjm02nuj39WcbGXF4%3D&reserved=0\">SNLI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fleaderboard.allenai.org%2Fscitail%2Fsubmissions%2Fpublic&data=04%7C01%7Cjfgao%40microsoft.com%7C55a83dbe6d254a2c992408d6a680c429%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636879469714121435%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C-1&sdata=RhAJV2%2BGEvVzU0kibnQmv1EjN0idOi511qsYsAKA%2BG0%3D&reserved=0\">SciTail<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<h3>MT-DNN architecture<\/h3>\n<p>MT-DNN extends the model <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/mltdnn.pdf\">proposed by Microsoft in 2015<\/a> by incorporating a pre-trained bidirectional transformer language model, known as BERT, developed by Google AI. The architecture of the MT-DNN model is illustrated in the following figure. The lower layers are shared across all tasks while the top layers are task-speci\ufb01c. The input X, either a sentence or a pair of sentences, is \ufb01rst represented as a sequence of embedding vectors, one for each word, in l_1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l_2. Finally, for each task, additional task-speci\ufb01c layers generate task-speci\ufb01c representations, followed by operations necessary for classi\ufb01cation, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using BERT, then refines them via MTL.<\/p>\n<div id=\"attachment_573423\" style=\"width: 836px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-573423\" class=\"wp-image-573423 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-1.png\" alt=\"\" width=\"826\" height=\"531\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-1.png 826w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-1-300x193.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-1-768x494.png 768w\" sizes=\"auto, (max-width: 826px) 100vw, 826px\" \/><\/a><p id=\"caption-attachment-573423\" class=\"wp-caption-text\">MT-DNN architecture<\/p><\/div>\n<h3>Domain Adaptation Results<\/h3>\n<p>One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels.<br \/>\nThe authors of the MT-DNN paper compared MT-DNN with BERT in domain adaption, where both models are adapted to a new task by gradually increasing the size of in-domain data for adaptation. The results on the SNLI and SciTail tasks are presented in the following table and figure. With only 0.1% of in-domain data (which amounts to 549 samples in SNLI and 23 samples in SciTail), MT-DNN achieves +80% in accuracy while BERT\u2019s accuracy is around 50%, demonstrating that the language embeddings learned by MT-DNN are substantially more universal than those of BERT.<\/p>\n<div id=\"attachment_573435\" style=\"width: 720px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-2.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-573435\" class=\"wp-image-573435 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-2.png\" alt=\"\" width=\"710\" height=\"265\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-2.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Language-Embeddings-image-2-300x112.png 300w\" sizes=\"auto, (max-width: 710px) 100vw, 710px\" \/><\/a><p id=\"caption-attachment-573435\" class=\"wp-caption-text\">MT-DNN accuracy as compares with BERT across SNLI and SciTail datasets.<\/p><\/div>\n<h3>Release news<\/h3>\n<p>Microsoft will release the MT-DNN package to public at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/namisan\/mt-dnn\">https:\/\/github.com\/namisan\/mt-dnn<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The release package contains the pretrained models, the source code and the Readme that describes step by step how to reproduce the results reported in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1901.11504\">the MT-DNN paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and how to adapt the pre-trained MT-DNN models to any new tasks via domain adaptation. We welcome your comments and feedback and look forward to future developments!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks. Two popular approaches to learning language embeddings [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":573774,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194456],"tags":[],"research-area":[13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-573408","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-natural-language-processing","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144736,144931],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/03\/Toward-Universal-Languages_Site_03_2019_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"March 18, 2019","formattedExcerpt":"Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/573408","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=573408"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/573408\/revisions"}],"predecessor-version":[{"id":591613,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/573408\/revisions\/591613"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/573774"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=573408"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=573408"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=573408"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=573408"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=573408"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=573408"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=573408"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=573408"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=573408"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=573408"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=573408"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}