{"id":800533,"date":"2021-12-06T12:24:23","date_gmt":"2021-12-06T20:24:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=800533"},"modified":"2021-12-15T15:25:43","modified_gmt":"2021-12-15T23:25:43","slug":"you-get-what-you-measure-new-nlu-benchmarks-for-few-shot-learning-and-robustness-evaluation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/you-get-what-you-measure-new-nlu-benchmarks-for-few-shot-learning-and-robustness-evaluation\/","title":{"rendered":"You get what you measure: New NLU benchmarks for few-shot learning and robustness evaluation"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-scaled.jpg\" alt=\"Diagram shows two columns with definitions of Data Efficiency, and Robustness. In the first column, \"Data efficiency\" the definition reads \"CLUES: A standard benchmark for evaluating methods for few-shot learning in NLU\". In the second column, \"Robustness\", it reads \"AdvGLUE: A multitask benchmark to quantitively and thoroughly evaluate the vulnerabilities of modern large-scale models\". \"\/><\/figure>\n\n\n\n<p>Recent progress in natural language understanding (NLU) has been driven in part by the availability of large-scale benchmarks that provide an environment for researchers to test and measure the performance of AI models. Most of these benchmarks are designed for academic settings&#8211;typically datasets that feature independent and identically distributed (IID) training, validation, and testing sections drawn from data that have been collected or annotated by crowdsourcing.<\/p>\n\n\n\n<p>However, increasing evidence shows that AI models that achieve human-level performance on academic benchmarks may underperform in real-world settings where a) task-specific labels are unavailable for model training and b) the dataset contains various adversarial examples. Ironically, models that reached human-level performance in academic settings highlight the limitation of these benchmarks in reliably evaluating the two capabilities of AI models that, we believe, are crucial for real-world applications:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Learning efficiency:<\/strong> Most existing benchmarks give models access to large amounts of labeled data for training&#8211;far more data than people need to achieve strong performance. Such large amounts of human-labeled data are rarely available for most real-world applications.<\/li><li><strong>Robustness:<\/strong> State-of-the-art models remain vulnerable to carefully crafted adversarial examples, which can fool the models into outputting arbitrarily wrong answers by altering input sentences in a way that is unnoticeable to humans. Real-world systems built upon these vulnerable models can be misled in ways that could have profound security concerns.<\/li><\/ol>\n\n\n\n<p>To address these limitations, Microsoft Research is releasing two new NLU benchmarks, including one created in collaboration with the University of Illinois at Urbana-Champaign, to simulate real-world settings where AI models must adapt to new tasks with few task labels and are robust to changes or adversarial attacks.<\/p>\n\n\n\n<h2 id=\"clues-evaluating-few-shot-learning-in-nlu\">CLUES: Evaluating few-shot learning in NLU<\/h2>\n\n\n\n<p>Despite increasing interest in data-efficient, few-shot learning with language models, no standardized evaluation benchmark exists for few-shot natural language understanding (NLU). As a result, different research studies yield different experimental settings.<\/p>\n\n\n\n<p>To help accelerate work on few-shot learning for NLU, Microsoft researchers have created CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models.<\/p>\n\n\n\n<p>CLUES was designed with the following principles in mind:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Task Selection:<\/strong> tasks should provide high coverage and diversity across NLU task types and must show a significant gap between human and machine performance<\/li><li><strong>Task Formulation:<\/strong> should follow a consistent format to unify different types of tasks and model families to encourage broad usage and adoption<\/li><li><strong>Evaluation:<\/strong> measuring true few-shot learning capabilities requires unified metrics to compare and aggregate model performance across diverse tasks and evaluation settings<\/li><\/ol>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"936\" height=\"391\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Table-1_CLUEs-Blog.png\" alt=\"Table 1. Examples of labeled instances from tasks in CLUES demonstrating wide coverage across task types and unified task format\" class=\"wp-image-800536\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Table-1_CLUEs-Blog.png 936w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Table-1_CLUEs-Blog-300x125.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Table-1_CLUEs-Blog-768x321.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Table-1_CLUEs-Blog-240x100.png 240w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><figcaption>Table 1. Examples of labeled instances from tasks in CLUES demonstrating wide coverage across task types and unified task format<\/figcaption><\/figure><\/div>\n\n\n\n<p>In a research paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/CLUES-Few-shot-Benchmark-NeurIPS-2021.pdf\">CLUES: Few-Shot Learning Evaluation in Natural Language Understanding<\/a>\u201d, we show that while recent models reach human performance when they have access to large amounts of task-specific labeled data, a huge gap in performance exist&nbsp;in the few-shot setting for many NLU tasks. In this paper, which has been accepted at the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/neurips-2021\/\">2021 Conference on Neural Information Processing Systems (NeurIPS2021)<\/a>, we also show differences between alternative model families and adaptation techniques in the few-shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation.<\/p>\n\n\n\n<p>The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/CLUES-Few-shot-Benchmark-NeurIPS-2021.pdf\">paper<\/a> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/pdf?id=VhIIQBm00VI\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>also presents other interesting findings, as summarized below:<\/p>\n\n\n\n<p><strong>Evaluation settings: <\/strong>We observe a wide variance in the few-shot performance with classic fine-tuning that is exacerbated by the model size, and considerable variance with prompt-based fine-tuning. As a result, we provide, and recommend using, multiple splits of few-shot training examples and reporting the mean and variance on a single test set to measure the robustness and generalizability of different models.<\/p>\n\n\n\n<p><strong>Human performance: <\/strong>We conduct extensive human annotation studies on all tasks to assess human performance in the limited labeled data settings. We observe that people can achieve very good performance on all tasks, even when shown only a small number of examples. For some tasks, humans do equally well when shown a short description of the task, even with no labeled examples at all.<\/p>\n\n\n\n<p><strong>Model vs. human performance: <\/strong>In the fully supervised setting, models can match or exceed human performance for many tasks. However, in the few-shot setting, humans outperform the best models with a huge performance gap. This gap is even more pronounced for more complex tasks like named entity recognition and machine reading comprehension, where people perform very well with only a few demonstrative examples, whereas all models perform close to random.<\/p>\n\n\n\n<p><strong>Model capacity:<\/strong> In the fully supervised setting with adequate training data, the performance of different models generally improves with increasing model size. However, for the few-shot setting, we do not observe any consistent trend or impact of the model size on the performance with classic fine-tuning for most tasks. Yet for prompt tuning, bigger models tend to perform better.<\/p>\n\n\n\n<p>We hope that CLUES encourages research on NLU models that can generalize to new tasks with a small number of examples. More details are available at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/clues-benchmark\">https:\/\/aka.ms\/clues-benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and in the upcoming NeurIPS 2021 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/CLUES-Few-shot-Benchmark-NeurIPS-2021.pdf\">paper<\/a>.<\/p>\n\n\n\n<h2 id=\"advglue-evaluating-the-robustness-of-language-models\">AdvGLUE: Evaluating the Robustness of language models<\/h2>\n\n\n\n<p>Despite the tremendous success of large-scale pre-trained language models across a wide range of NLU tasks, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing.<\/p>\n\n\n\n<p>To quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks, researchers from Microsoft and the University of Illinois at Urbana-Champaign have created Adversarial GLUE (AdvGLUE), a new multi-task benchmark. This work is detailed in the paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/adversarial-glue-a-multi-task-benchmark-for-robustness-evaluation-of-language-models\/\">Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models<\/a>,\u201d which has been accepted for publication at NeurIPS 2021.<\/p>\n\n\n\n<p>AdvGLUE systematically applies 14 textual adversarial attack methods to GLUE tasks. We then perform extensive filtering processes, including validation by humans, to exclude erroneous or poor-quality examples. This helps produce the reliable annotations necessary to curate a high-quality benchmark. Most existing automated adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples. Around 90% of them either change the original semantic meanings or mislead both human annotators and models.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure1_CLUES-Figure.png\" alt=\"Figure 1. Overview of the AdvGLUE benchmark construction pipeline\" class=\"wp-image-800539\" width=\"900\" height=\"290\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure1_CLUES-Figure.png 936w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure1_CLUES-Figure-300x97.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure1_CLUES-Figure-768x248.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure1_CLUES-Figure-240x77.png 240w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><figcaption>Figure 1. Overview of the AdvGLUE benchmark construction pipeline<\/figcaption><\/figure><\/div>\n\n\n\n<p>AdvGLUE is designed to create a unique and valuable asset to the community for improving the robustness of language models, with the following objectives in mind:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Comprehensive coverage: <\/strong>We consider textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples. As a result, we believe AdvGLUE provides substantial coverage of adversarial linguistic phenomena.<\/li><li><strong>Systematic human annotations:<\/strong> Systematic evaluation and annotation over the generated textual adversarial examples can be accomplished through crowdsourcing to identify high-quality adversarial data for reliable evaluation.<\/li><li><strong>General compatibility: <\/strong>To obtain comprehensive understanding of the robustness of language models across different NLU tasks, AdvGLUE covers the widely-used GLUE tasks and creates an adversarial version of the GLUE benchmark to evaluate the robustness of language models.<\/li><li><strong>High transferability and effectiveness:<\/strong> AdvGLUE has high adversarial transferability and can effectively attack a wide range of state-of-the-art models. We observe a significant performance drop for models evaluated on AdvGLUE compared with their standard accuracy on the GLUE leaderboard. For instance, the average GLUE score of ELECTRA (Large) drops from 93.16 to 41.69.<\/li><\/ul>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"550\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-1024x550.png\" alt=\"chart, scatter chart\" class=\"wp-image-800542\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-1024x550.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-300x161.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-768x412.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-1536x825.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-2048x1100.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Figure2_DNA-Storage_-Figure-240x129.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2. Performance of SOTA language models on GLUE vs. AdvGLUE<\/figcaption><\/figure><\/div>\n\n\n\n<p>We hope that AdvGLUE will inspire active research and discussion in the community. More details are available at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/adversarialglue.github.io\">https:\/\/adversarialglue.github.io<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and in the upcoming NeurIPS 2021 <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/pdf?id=GF9cSKI3A_q\">paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/CLUES-Few-shot-Benchmark-NeurIPS-2021.pdf\">first paper<\/a> in this post was a collaboration across Microsoft researchers, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/submukhe\/\">Subhabrata Mukherjee<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaodl\/\">Xiaodong Liu<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zheng\/\">Guoqing Zheng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sahoss\/\">Saghar Hosseini<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chehao\/\">Hao Cheng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/gregyang\/\">Greg Yang<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/chris-meek-4b8332\/\">Chris Meek<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hassanam\/\">Ahmed Awadallah<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>. <\/p>\n\n\n\n<p>The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/pdf?id=GF9cSKI3A_q\">second paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in this post was collaboration across Microsoft researchers Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao and Ahmed H. Awadallah as well as researchers from the University of Illinois at Urbana-Champaign, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/wbx.life\/\">Boxin Wang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Chejian Xu and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aisecure.github.io\/\">Bo Li.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recent progress in natural language understanding (NLU) has been driven in part by the availability of large-scale benchmarks that provide an environment for researchers to test and measure the performance of AI models. Most of these benchmarks are designed for academic settings&#8211;typically datasets that feature independent and identically distributed (IID) training, validation, and testing sections [&hellip;]<\/p>\n","protected":false},"author":40735,"featured_media":800545,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"},{"type":"user_nicename","value":"Ahmed H. Awadallah","user_id":"31979"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-800533","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[761314],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"},{"type":"user_nicename","value":"Ahmed H. Awadallah","user_id":31979,"display_name":"Ahmed Awadallah","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hassanam\/\" aria-label=\"Visit the profile page for Ahmed Awadallah\">Ahmed Awadallah<\/a>","is_active":false,"last_first":"Awadallah, Ahmed","people_section":0,"alias":"hassanam"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-960x540.jpg\" class=\"img-object-cover\" alt=\"Clues diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Clues_still_No_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hassanam\/\" title=\"Go to researcher profile for Ahmed Awadallah\" aria-label=\"Go to researcher profile for Ahmed Awadallah\" data-bi-type=\"byline author\" data-bi-cN=\"Ahmed Awadallah\">Ahmed Awadallah<\/a>","formattedDate":"December 6, 2021","formattedExcerpt":"Recent progress in natural language understanding (NLU) has been driven in part by the availability of large-scale benchmarks that provide an environment for researchers to test and measure the performance of AI models. Most of these benchmarks are designed for academic settings--typically datasets that feature&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/800533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=800533"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/800533\/revisions"}],"predecessor-version":[{"id":805453,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/800533\/revisions\/805453"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/800545"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=800533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=800533"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=800533"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=800533"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=800533"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=800533"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=800533"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=800533"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=800533"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=800533"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=800533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}