{"id":851413,"date":"2022-06-13T09:24:26","date_gmt":"2022-06-13T16:24:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-event&#038;p=851413"},"modified":"2022-11-10T22:41:33","modified_gmt":"2022-11-11T06:41:33","slug":"sumeval-2022","status":"publish","type":"msr-event","link":"https:\/\/www.microsoft.com\/en-us\/research\/event\/sumeval-2022\/","title":{"rendered":"SUMEval 2022"},"content":{"rendered":"\n\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<p>Massively Multilingual Language Models (MMLMs) are trained on around 100 languages of the world, however, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages. The languages present in evaluation benchmarks are usually high-resource and largely belong to the Indo-European language family. This makes current multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. Although efforts are being made to create benchmarks that cover a larger variety of tasks, languages, and language families, it is unlikely that we will be able to build benchmarks covering all languages and tasks. Due to this, there is recent interest in alternate strategies for evaluating MMLMs, including performance prediction and Machine Translation of test data. We believe that this is an important yet relatively unexplored area of research that has the potential to make language technologies <strong>accessible to all<\/strong>. The SUMEval workshop will accept submissions on alternate techniques for scaling up multilingual evaluation. In addition, the workshop will also include a shared task on performance prediction.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column has-light-blue-background-color has-background is-layout-flow wp-block-column-is-layout-flow\" style=\"padding-top:12px;padding-right:12px;padding-bottom:12px;padding-left:12px;flex-basis:33.33%\">\n<h4 id=\"timeline\">Timeline<\/h4>\n\n\n\n<p><em>Dates subject to change and will be updated here as needed.<\/em><\/p>\n\n\n\n<p><strong>June 28, 2022<\/strong>: Challenge data released<br><strong>August 1, 2022<\/strong>: Challenge evaluation begins<br><strong>August 10, 2022<\/strong>: Challenge ends<br><s> <strong>August 25, 2022<\/strong><\/s> <strong>September 23 2022<\/strong>: Workshop paper submission deadline<br> <strong>October 7, 2022<\/strong>: Notification of Acceptance<br> <strong>October 24, 2022<\/strong>: Camera-ready papers due<br> <strong>November 20 2022<\/strong>: SUMEval 2022 Workshop<\/p>\n\n\n\n<p><\/p>\n<\/div>\n<\/div>\n\n\n\n<p>Topics of interest include but are not restricted to:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Studies on scaling up multilingual evaluation<\/li><li>Human evaluation of multilingual models<\/li><li>Automated evaluation metrics for multilingual evaluation<\/li><li>Studies on fairness and other aspects of evaluation<\/li><li>Data sets, benchmarks or libraries for evaluating multi-lingual models<\/li><li>Probing and analysis of multilingual models<\/li><\/ul>\n\n\n\n<p><strong>Organizers:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/kabirahuja2431.github.io\/\">Kabir Ahuja<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Microsoft<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cs.gmu.edu\/~antonis\/author\/antonios-anastasopoulos\/\">Antonios Anastasopoulos,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> George Mason University<\/li><li>Vishrav Chaudhary, Microsoft<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/monojitc\/\">Monojit Choudhury<\/a>, Microsoft<\/li><li>Sandipan Dandapat, Microsoft<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.phontron.com\/\">Graham Neubig<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Carnegie Mellon University<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bapatra\/\">Barun Patra<\/a>, Microsoft<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/susitara\/\">Sunayana Sitaram<\/a>, Microsoft<\/li><\/ul>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"mailto:sumeval@microsoft.com\">Contact us<\/a><\/div>\n<\/div>\n\n\n\n\n\n<p> The task of performance prediction is to be able to accurately predict the performance of a model on a set of target languages. These languages may be present in the fine-tuning data (few-shot training) or may not be present (zero-shot training). The languages used for fine-tuning are referred to as pivots, while the languages that we would like to evaluate model on are targets. This&nbsp;shared task&nbsp;will consist of building a machine learning model that can accurately predict the performance of a multilingual model on languages and tasks that we do not have test data for, given accuracies of models on various combinations of pivot and target pairs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"269\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-1024x269.png\" alt=\"diagram\" class=\"wp-image-852741\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-1024x269.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-300x79.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-768x202.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-1536x403.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-2048x537.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-240x63.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Challenge details:<\/strong><\/p>\n\n\n\n<p>We will release a dataset containing evaluation scores of multiple MMLMs on different tasks and languages. These scores can be used to train models that can predict how MMLMs trained on different pivot configurations will perform on target languages. For example, our training dataset may be as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>MMLM<\/strong><\/td><td><strong>Task<\/strong><\/td><td><strong>Training Pivot configuration (language, data size)<\/strong><\/td><td><strong>Evaluation result (language, score)<\/strong><\/td><\/tr><tr><td>mBERT<\/td><td>TyDiQA<\/td><td>(en, 1000), (sw, 1000), (ko, 1600), (ar, 14000) <\/td><td>(te, 0.81), (ko, 0.68), (id, 0.85), (en, 0.77)<\/td><\/tr><tr><td>XLMR<\/td><td>TyDiQA<\/td><td>(en, 10000), (bn, 2000), (te, 2000), (ru, 1000)<\/td><td>(te, 0.84), (ko, 0.69), (id, 0.87), (en, 0.76)<\/td><\/tr><\/tbody><\/table><figcaption>Example training data released for the shared task<\/figcaption><\/figure>\n\n\n\n<p>The task is now to predict the model&#8217;s performance, given the following training configuration and test languages:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>MMLM<\/strong><\/td><td><strong>Task<\/strong><\/td><td><strong>Training Pivot configuration (language, data size)<\/strong><\/td><td><strong>Predicted result (language, score)<\/strong><\/td><\/tr><tr><td>XLMR<\/td><td>TyDiQA<\/td><td>(en, 5000), (ar, 2000), (fi, 1000), (ko, 3000)<\/td><td>(te, ??), (ko, ??), (id, ??), (en, ??), (zh, ??), (ja, ??)<\/td><\/tr><\/tbody><\/table><figcaption>Example test instance<\/figcaption><\/figure>\n\n\n\n<p>Predictions will need to be made on test languages included in the training data, as well as surprise languages. For more details on the task formulation, please refer to the papers at the bottom of this page. <\/p>\n\n\n\n<p><strong>Dataset: <\/strong>The training dataset for the challenge can be found <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/1drv.ms\/u\/s!AvNigI3ur_6Fgd80AdtZl1SxtDNs_g?e=ZIDn1e\" target=\"_blank\" rel=\"noopener noreferrer\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Please refer to the Readme in the folder for details. <\/p>\n\n\n\n<p><strong>Evaluation procedure and Baseline numbers:<\/strong> Evaluation will be done in two conditions: Leave One Configuration Out (LOCO) and Surprise Languages.<\/p>\n\n\n\n<p><strong>Evaluation instructions<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/Litmus\/tree\/main\/SumEval\">Litmus\/SumEval at main \u00b7 microsoft\/Litmus \u00b7 GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><strong>Challenge Timeline (tentative):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>June 28 2022: Dataset release<\/li><li>July 1 2022: Baseline numbers release<\/li><li><strong>August 1 2022: Test set release (non-surprise languages) and leaderboard opens<\/strong>. <\/li><li><strong>August 8 2022: Surprise languages test set released.<\/strong><\/li><li><strong>August 15 2022: Challenge ends<\/strong><\/li><li>August 25 2022: Paper submission deadline<\/li><\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>References:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Xia, Mengzhou, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, and Graham Neubig. &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aclanthology.lst.uni-saarland.de\/2020.acl-main.764.pdf\">Predicting Performance for Natural Language Processing Tasks.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&#8221; In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8625-8646. 2020.<\/li><li>Srinivasan, Anirudh, Gauri Kholkar, Rahul Kejriwal, Tanuja Ganu, Sandipan Dandapat, Sunayana Sitaram, Balakrishnan Santhanam, Somak Aditya, Kalika Bali, and Monojit Choudhury. &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.aaai.org\/AAAI22Papers\/DEMO-00144-SrinivasanA.pdf\">Litmus predictor: An AI assistant for building reliable, high-performing and fair multilingual NLP systems.&#8221;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> In&nbsp;<em>Thirty-sixth AAAI Conference on Artificial Intelligence. AAAI. System Demonstration<\/em>. 2022.<\/li><li>Ye, Zihuiwen, Pengfei Liu, Jinlan Fu, and Graham Neubig. &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aclanthology.org\/2021.eacl-main.324.pdf\">Towards More Fine-grained and Reliable NLP Performance Prediction.&#8221;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3703-3714. 2021.<\/li><li>Ahuja, Kabir, Shanu Kumar, Sandipan Dandapat, and Monojit Choudhury. &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aclanthology.org\/2022.acl-long.374.pdf\">Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&#8221; In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5454-5467. 2022.<\/li><\/ul>\n\n\n\n\n\n<ul class=\"wp-block-list\"><li>Paper submission deadline: <strong><s>Aug 25, 2022<\/s><\/strong> <strong>September 23 2022. In addition to regular papers submitted to the workshop, we will also accept papers that were submitted elsewhere and papers with AACL and ARR reviews. Papers that have been submitted elsewhere will not be included in proceedings, but participants will get a chance to present them during the workshop.<\/strong> Please get in touch if you have any questions.<\/li><li>The paper submission will be done via OpenReview. Submission link: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/group?id=aclweb.org\/AACL-IJCNLP\/2022\/Workshop\/SUMEval\">https:\/\/openreview.net\/group?id=aclweb.org\/AACL-IJCNLP\/2022\/Workshop\/SUMEval<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li><li>Submitted manuscripts must be 8 pages long for full papers, and 4 pages long for short papers. Both full and short papers can have unlimited pages for references and appendices. We follow ARR submission guidelines. For more information about templates, guidelines, and instructions, see the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Faclrollingreview.org%2Fcfp&sa=D&sntz=1&usg=AOvVaw1dfgoYWeGrsJJrd7KEpHcr\" target=\"_blank\" rel=\"noopener noreferrer\">ARR CFP guidelines<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We encourage authors to include a broader impact and ethical concerns statement, following&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Faclrollingreview.org%2Fcfp&sa=D&sntz=1&usg=AOvVaw1dfgoYWeGrsJJrd7KEpHcr\" target=\"_blank\" rel=\"noopener noreferrer\">ARR Ethics Policy<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;from the main conference.<\/li><li>All submissions will be double-blind peer-reviewed (with author names and affiliations removed) by the program committee and judged by their relevance to the workshop themes.<\/li><li>Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper.<\/li><\/ul>\n\n\n\n\n\n<p>SUMEval2022 will be held (starting) on 20th November 2022 and will be merged with the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/eval4nlp.github.io\/2022\/index.html\">Eval4NLP workshop<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> this year. Please find the combined schedule in Taipei time below.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Session<\/strong><\/td><td><strong>Time (Taipei time, UTC+8)<\/strong><\/td><td><strong>Paper ID<\/strong><\/td><td><strong>Paper Title<\/strong><\/td><td><strong>Authors<\/strong><\/td><\/tr><tr><td>1<\/td><td>18:30:00 &#8211; 18:45<\/td><td><\/td><td>Opening Presentation<\/td><td><\/td><\/tr><tr><td>1<\/td><td>18:45:00 &#8211; 19:30<\/td><td><\/td><td>Invited Talk<\/td><td>Anya Belz<\/td><\/tr><tr><td>1<\/td><td>19:30:00 &#8211; 19:45<\/td><td><\/td><td>Why sentence similarity benchmark is not predictive of application-oriented task performance?<\/td><td>Kaori Abe, Sho Yokoi, Tomoyuki Kajiwara, Kentaro Inui<\/td><\/tr><tr><td>1<\/td><td>19:45:00 &#8211; 20:00<\/td><td><\/td><td>Better Smatch = Better Parser? AMR evaluation is not so simple anymore<\/td><td>Juri Opitz, Anette Frank<\/td><\/tr><tr><td>1<\/td><td>20:00:00 &#8211; 20:15<\/td><td><\/td><td>Chat Translation Error Detection for Assisting Cross-lingual Communications<\/td><td>Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui<\/td><\/tr><tr><td>2<\/td><td>21:15:00 &#8211; 22:00<\/td><td><\/td><td>Invited Talk<\/td><td>Reinald Kim Amplayo<\/td><\/tr><tr><td>2<\/td><td>22:00:00 &#8211; 22:15<\/td><td><\/td><td>A Japanese Corpus of Many Specialized Domains for Word Segmentation and Part-of-Speech Tagging<\/td><td>Shohei Higashiyama, Masao Ideuchi, Masao Utiyama, Yoshiaki Oida, Eiichiro Sumita<\/td><\/tr><tr><td>2<\/td><td>22:15:00 &#8211; 22:30<\/td><td><\/td><td>Evaluating the role of non-lexical markers in GPT-2\u2019s language modeling behavior<\/td><td>Roberta Rocca<\/td><\/tr><tr><td>2<\/td><td>22:30:00 &#8211; 22:45<\/td><td><\/td><td>From COMET to COMES \u2013 Can Summary Evaluation Benefit from Translation Evaluation?<\/td><td>Mateusz Krubi\u0144ski, Pavel Pecina<\/td><\/tr><tr><td>2<\/td><td>22:45:00 &#8211; 23:00<\/td><td><\/td><td>Random Text Perturbations Work, but not Always<\/td><td>Zhengxiang Wang<\/td><\/tr><tr><td>3<\/td><td>23:30:00 &#8211; 00:15<\/td><td><\/td><td>Invited Talk<\/td><td>Maxime Peyrard<\/td><\/tr><tr><td>3<\/td><td>00:15:00 &#8211; 00:30<\/td><td><\/td><td>A Comparative Analysis of Stance Detection Approaches and Datasets<\/td><td>Parush Gera, Tempestt Neal<\/td><\/tr><tr><td>3<\/td><td>00:30:00 &#8211; 00:45<\/td><td><\/td><td>Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset<\/td><td>Guanyi Chen, Fahime Same, Kees Van Deemter<\/td><\/tr><tr><td>3<\/td><td>00:45:00 &#8211; 01:00<\/td><td><\/td><td>Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis<\/td><td>Zachary Zhou, Alisha Zachariah, Devin Conathan, Jeffery Kline<\/td><\/tr><tr><td>3<\/td><td>01:00:00 &#8211; 01:15<\/td><td><\/td><td>GLARE: Generative Left-to-right AdversaRial Examples<\/td><td>Ryan Andrew Chi, Nathan Kim, Patrick Liu, Zander Lack, Ethan A Chi<\/td><\/tr><tr><td>4<\/td><td>01:30-02:10<\/td><td><\/td><td>Invited Talk Masakhane<\/td><td>Masakhane<\/td><\/tr><tr><td>4<\/td><td>02:10-02:20<\/td><td><\/td><td>SUMEval shared task overview<\/td><td>Kabir Ahuja, Antonios Anastasopoulos, Barun Patra, Graham Neubig, Monojit Choudhury, Sandipan Dandapat, Sunayana Sitaram and Vishrav Chaudhary<\/td><\/tr><tr><td>4<\/td><td>02:20-02:30<\/td><td>SUMEval-2<\/td><td>To Train or Not to Train: Predicting the Performance of Massively Multilingual Models<\/td><td>Shantanu Patankar, Omkar Gokhale, Onkar Litake, Aditya Mandke and Dipali Kadam<\/td><\/tr><tr><td>4<\/td><td>02:30-02:40<\/td><td>SUMEval-1<\/td><td>The GMU System Submission for the SUMEval 2022 Shared Task<\/td><td>Syeda Sabrina Akter and Antonios Anastasopoulos<\/td><\/tr><tr><td>4<\/td><td>02:40-02:55<\/td><td>SUMEval-5<\/td><td>NTREX-128 &#8212; News Test References for MT Evaluation of 128 Languages<\/td><td>Christian Federmann, Tom Kocmi, Ying Xin<\/td><\/tr><tr><td>4<\/td><td>02:55-03:10<\/td><td>SUMEval-6<\/td><td>IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages<\/td><td>Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwariant<\/td><\/tr><tr><td>4<\/td><td>03:10-03:20<\/td><td><\/td><td>Closing<\/td><td><\/td><\/tr><\/tbody><\/table><figcaption>SUMEval 2022 Workshop Schedule on 20 November 2022<\/figcaption><\/figure>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Massively Multilingual Language Models (MMLMs) are trained on around 100 languages of the world, however, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages. The languages present in evaluation benchmarks are usually high-resource and largely belong to the Indo-European language family. This makes current multilingual evaluation unreliable and does [&hellip;]<\/p>\n","protected":false},"featured_media":851050,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr_startdate":"2022-11-20","msr_enddate":"","msr_location":"Online only","msr_expirationdate":"","msr_event_recording_link":"","msr_event_link":"","msr_event_link_redirect":false,"msr_event_time":"","msr_hide_region":false,"msr_private_event":false,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[13556],"msr-region":[197903,256048],"msr-event-type":[210063],"msr-video-type":[],"msr-locale":[268875],"msr-program-audience":[],"msr-post-option":[],"msr-impact-theme":[],"class_list":["post-851413","msr-event","type-msr-event","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-region-asia-pacific","msr-region-global","msr-event-type-workshop","msr-locale-en_us"],"msr_about":"<!-- wp:msr\/event-details {\"title\":\"SUMEval 2022\",\"hasSubtitle\":true,\"subTitle\":\"Scaling Up Multilingual Evaluation Workshop @\u00a0\\u003ca href=\\u0022http:\/\/www.aacl2022.org\/\\u0022\\u003eAACL 2022\\u003c\/a\\u003e\",\"image\":{\"id\":851050,\"url\":\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35.png\",\"alt\":\"SumEval logo\"}} \/-->\n\n<!-- wp:msr\/content-tabs -->\n<!-- wp:msr\/content-tab -->\n<!-- wp:columns -->\n<div class=\"wp-block-columns\"><!-- wp:column {\"width\":\"66.66%\"} -->\n<div class=\"wp-block-column\" style=\"flex-basis:66.66%\"><!-- wp:paragraph {\"placeholder\":\"Add Event Overview content\u2026\"} -->\n<p>Massively Multilingual Language Models (MMLMs) are trained on around 100 languages of the world, however, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages. The languages present in evaluation benchmarks are usually high-resource and largely belong to the Indo-European language family. This makes current multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. Although efforts are being made to create benchmarks that cover a larger variety of tasks, languages, and language families, it is unlikely that we will be able to build benchmarks covering all languages and tasks. Due to this, there is recent interest in alternate strategies for evaluating MMLMs, including performance prediction and Machine Translation of test data. We believe that this is an important yet relatively unexplored area of research that has the potential to make language technologies <strong>accessible to all<\/strong>. The SUMEval workshop will accept submissions on alternate techniques for scaling up multilingual evaluation. In addition, the workshop will also include a shared task on performance prediction.<\/p>\n<!-- \/wp:paragraph --><\/div>\n<!-- \/wp:column -->\n\n<!-- wp:column {\"width\":\"33.33%\",\"style\":{\"spacing\":{\"padding\":{\"top\":\"12px\",\"right\":\"12px\",\"bottom\":\"12px\",\"left\":\"12px\"}}},\"backgroundColor\":\"light-blue\"} -->\n<div class=\"wp-block-column has-light-blue-background-color has-background\" style=\"padding-top:12px;padding-right:12px;padding-bottom:12px;padding-left:12px;flex-basis:33.33%\"><!-- wp:heading {\"level\":4} -->\n<h4>Timeline<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>Dates subject to change and will be updated here as needed.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>June 28, 2022<\/strong>: Challenge data released<br><strong>August 1, 2022<\/strong>: Challenge evaluation begins<br><strong>August 10, 2022<\/strong>: Challenge ends<br><s> <strong>August 25, 2022<\/strong><\/s> <strong>September 23 2022<\/strong>: Workshop paper submission deadline<br> <strong>October 7, 2022<\/strong>: Notification of Acceptance<br> <strong>October 24, 2022<\/strong>: Camera-ready papers due<br> <strong>November 20 2022<\/strong>: SUMEval 2022 Workshop<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph --><\/div>\n<!-- \/wp:column --><\/div>\n<!-- \/wp:columns -->\n\n<!-- wp:paragraph -->\n<p>Topics of interest include but are not restricted to:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul><li>Studies on scaling up multilingual evaluation<\/li><li>Human evaluation of multilingual models<\/li><li>Automated evaluation metrics for multilingual evaluation<\/li><li>Studies on fairness and other aspects of evaluation<\/li><li>Data sets, benchmarks or libraries for evaluating multi-lingual models<\/li><li>Probing and analysis of multilingual models<\/li><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong>Organizers:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul><li><a href=\"https:\/\/kabirahuja2431.github.io\/\">Kabir Ahuja<\/a>, Microsoft<\/li><li><a href=\"https:\/\/cs.gmu.edu\/~antonis\/author\/antonios-anastasopoulos\/\">Antonios Anastasopoulos,<\/a> George Mason University<\/li><li>Vishrav Chaudhary, Microsoft<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/monojitc\/\">Monojit Choudhury<\/a>, Microsoft<\/li><li>Sandipan Dandapat, Microsoft<\/li><li><a href=\"http:\/\/www.phontron.com\/\">Graham Neubig<\/a>, Carnegie Mellon University<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bapatra\/\">Barun Patra<\/a>, Microsoft<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/susitara\/\">Sunayana Sitaram<\/a>, Microsoft<\/li><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:buttons -->\n<div class=\"wp-block-buttons\"><!-- wp:button -->\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link\" href=\"mailto:sumeval@microsoft.com\">Contact us<\/a><\/div>\n<!-- \/wp:button --><\/div>\n<!-- \/wp:buttons -->\n<!-- \/wp:msr\/content-tab -->\n\n<!-- wp:msr\/content-tab {\"title\":\"Shared task\"} -->\n<!-- wp:paragraph {\"placeholder\":\"Write content\u2026\"} -->\n<p> The task of performance prediction is to be able to accurately predict the performance of a model on a set of target languages. These languages may be present in the fine-tuning data (few-shot training) or may not be present (zero-shot training). The languages used for fine-tuning are referred to as pivots, while the languages that we would like to evaluate model on are targets. This&nbsp;shared task&nbsp;will consist of building a machine learning model that can accurately predict the performance of a multilingual model on languages and tasks that we do not have test data for, given accuracies of models on various combinations of pivot and target pairs.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":852741,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_pp_setup-1024x269.png\" alt=\"diagram\" class=\"wp-image-852741\" \/><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><strong>Challenge details:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>We will release a dataset containing evaluation scores of multiple MMLMs on different tasks and languages. These scores can be used to train models that can predict how MMLMs trained on different pivot configurations will perform on target languages. For example, our training dataset may be as follows:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table -->\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>MMLM<\/strong><\/td><td><strong>Task<\/strong><\/td><td><strong>Training Pivot configuration (language, data size)<\/strong><\/td><td><strong>Evaluation result (language, score)<\/strong><\/td><\/tr><tr><td>mBERT<\/td><td>TyDiQA<\/td><td>(en, 1000), (sw, 1000), (ko, 1600), (ar, 14000) <\/td><td>(te, 0.81), (ko, 0.68), (id, 0.85), (en, 0.77)<\/td><\/tr><tr><td>XLMR<\/td><td>TyDiQA<\/td><td>(en, 10000), (bn, 2000), (te, 2000), (ru, 1000)<\/td><td>(te, 0.84), (ko, 0.69), (id, 0.87), (en, 0.76)<\/td><\/tr><\/tbody><\/table><figcaption>Example training data released for the shared task<\/figcaption><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>The task is now to predict the model's performance, given the following training configuration and test languages:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table -->\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>MMLM<\/strong><\/td><td><strong>Task<\/strong><\/td><td><strong>Training Pivot configuration (language, data size)<\/strong><\/td><td><strong>Predicted result (language, score)<\/strong><\/td><\/tr><tr><td>XLMR<\/td><td>TyDiQA<\/td><td>(en, 5000), (ar, 2000), (fi, 1000), (ko, 3000)<\/td><td>(te, ??), (ko, ??), (id, ??), (en, ??), (zh, ??), (ja, ??)<\/td><\/tr><\/tbody><\/table><figcaption>Example test instance<\/figcaption><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Predictions will need to be made on test languages included in the training data, as well as surprise languages. For more details on the task formulation, please refer to the papers at the bottom of this page. <\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Dataset: <\/strong>The training dataset for the challenge can be found <a href=\"https:\/\/1drv.ms\/u\/s!AvNigI3ur_6Fgd80AdtZl1SxtDNs_g?e=ZIDn1e\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. Please refer to the Readme in the folder for details. <\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Evaluation procedure and Baseline numbers:<\/strong> Evaluation will be done in two conditions: Leave One Configuration Out (LOCO) and Surprise Languages.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Evaluation instructions<\/strong>: <a href=\"https:\/\/github.com\/microsoft\/Litmus\/tree\/main\/SumEval\">Litmus\/SumEval at main \u00b7 microsoft\/Litmus \u00b7 GitHub<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Challenge Timeline (tentative):<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul><li>June 28 2022: Dataset release<\/li><li>July 1 2022: Baseline numbers release<\/li><li><strong>August 1 2022: Test set release (non-surprise languages) and leaderboard opens<\/strong>. <\/li><li><strong>August 8 2022: Surprise languages test set released.<\/strong><\/li><li><strong>August 15 2022: Challenge ends<\/strong><\/li><li>August 25 2022: Paper submission deadline<\/li><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>References:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul><li>Xia, Mengzhou, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, and Graham Neubig. \"<a href=\"http:\/\/aclanthology.lst.uni-saarland.de\/2020.acl-main.764.pdf\">Predicting Performance for Natural Language Processing Tasks.<\/a>\" In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8625-8646. 2020.<\/li><li>Srinivasan, Anirudh, Gauri Kholkar, Rahul Kejriwal, Tanuja Ganu, Sandipan Dandapat, Sunayana Sitaram, Balakrishnan Santhanam, Somak Aditya, Kalika Bali, and Monojit Choudhury. \"<a href=\"https:\/\/www.aaai.org\/AAAI22Papers\/DEMO-00144-SrinivasanA.pdf\">Litmus predictor: An AI assistant for building reliable, high-performing and fair multilingual NLP systems.\"<\/a> In&nbsp;<em>Thirty-sixth AAAI Conference on Artificial Intelligence. AAAI. System Demonstration<\/em>. 2022.<\/li><li>Ye, Zihuiwen, Pengfei Liu, Jinlan Fu, and Graham Neubig. \"<a href=\"https:\/\/aclanthology.org\/2021.eacl-main.324.pdf\">Towards More Fine-grained and Reliable NLP Performance Prediction.\"<\/a> In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3703-3714. 2021.<\/li><li>Ahuja, Kabir, Shanu Kumar, Sandipan Dandapat, and Monojit Choudhury. \"<a href=\"https:\/\/aclanthology.org\/2022.acl-long.374.pdf\">Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models.<\/a>\" In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5454-5467. 2022.<\/li><\/ul>\n<!-- \/wp:list -->\n<!-- \/wp:msr\/content-tab -->\n\n<!-- wp:msr\/content-tab {\"title\":\"Paper submission\"} -->\n<!-- wp:list -->\n<ul><li>Paper submission deadline: <strong><s>Aug 25, 2022<\/s><\/strong> <strong>September 23 2022. In addition to regular papers submitted to the workshop, we will also accept papers that were submitted elsewhere and papers with AACL and ARR reviews. Papers that have been submitted elsewhere will not be included in proceedings, but participants will get a chance to present them during the workshop.<\/strong> Please get in touch if you have any questions.<\/li><li>The paper submission will be done via OpenReview. Submission link: <a href=\"https:\/\/openreview.net\/group?id=aclweb.org\/AACL-IJCNLP\/2022\/Workshop\/SUMEval\">https:\/\/openreview.net\/group?id=aclweb.org\/AACL-IJCNLP\/2022\/Workshop\/SUMEval<\/a><\/li><li>Submitted manuscripts must be 8 pages long for full papers, and 4 pages long for short papers. Both full and short papers can have unlimited pages for references and appendices. We follow ARR submission guidelines. For more information about templates, guidelines, and instructions, see the&nbsp;<a href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Faclrollingreview.org%2Fcfp&amp;sa=D&amp;sntz=1&amp;usg=AOvVaw1dfgoYWeGrsJJrd7KEpHcr\" target=\"_blank\" rel=\"noreferrer noopener\">ARR CFP guidelines<\/a>. We encourage authors to include a broader impact and ethical concerns statement, following&nbsp;<a href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Faclrollingreview.org%2Fcfp&amp;sa=D&amp;sntz=1&amp;usg=AOvVaw1dfgoYWeGrsJJrd7KEpHcr\" target=\"_blank\" rel=\"noreferrer noopener\">ARR Ethics Policy<\/a>&nbsp;from the main conference.<\/li><li>All submissions will be double-blind peer-reviewed (with author names and affiliations removed) by the program committee and judged by their relevance to the workshop themes.<\/li><li>Please note that at least one of the authors of each accepted paper must register for the workshop and present the paper.<\/li><\/ul>\n<!-- \/wp:list -->\n<!-- \/wp:msr\/content-tab -->\n\n<!-- wp:msr\/content-tab {\"title\":\"Schedule\"} -->\n<!-- wp:paragraph -->\n<p>SUMEval2022 will be held (starting) on 20th November 2022 and will be merged with the <a href=\"https:\/\/eval4nlp.github.io\/2022\/index.html\">Eval4NLP workshop<\/a> this year. Please find the combined schedule in Taipei time below.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table -->\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Session<\/strong><\/td><td><strong>Time (Taipei time, UTC+8)<\/strong><\/td><td><strong>Paper ID<\/strong><\/td><td><strong>Paper Title<\/strong><\/td><td><strong>Authors<\/strong><\/td><\/tr><tr><td>1<\/td><td>18:30:00 - 18:45<\/td><td><\/td><td>Opening Presentation<\/td><td><\/td><\/tr><tr><td>1<\/td><td>18:45:00 - 19:30<\/td><td><\/td><td>Invited Talk<\/td><td>Anya Belz<\/td><\/tr><tr><td>1<\/td><td>19:30:00 - 19:45<\/td><td><\/td><td>Why sentence similarity benchmark is not predictive of application-oriented task performance?<\/td><td>Kaori Abe, Sho Yokoi, Tomoyuki Kajiwara, Kentaro Inui<\/td><\/tr><tr><td>1<\/td><td>19:45:00 - 20:00<\/td><td><\/td><td>Better Smatch = Better Parser? AMR evaluation is not so simple anymore<\/td><td>Juri Opitz, Anette Frank<\/td><\/tr><tr><td>1<\/td><td>20:00:00 - 20:15<\/td><td><\/td><td>Chat Translation Error Detection for Assisting Cross-lingual Communications<\/td><td>Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui<\/td><\/tr><tr><td>2<\/td><td>21:15:00 - 22:00<\/td><td><\/td><td>Invited Talk<\/td><td>Reinald Kim Amplayo<\/td><\/tr><tr><td>2<\/td><td>22:00:00 - 22:15<\/td><td><\/td><td>A Japanese Corpus of Many Specialized Domains for Word Segmentation and Part-of-Speech Tagging<\/td><td>Shohei Higashiyama, Masao Ideuchi, Masao Utiyama, Yoshiaki Oida, Eiichiro Sumita<\/td><\/tr><tr><td>2<\/td><td>22:15:00 - 22:30<\/td><td><\/td><td>Evaluating the role of non-lexical markers in GPT-2\u2019s language modeling behavior<\/td><td>Roberta Rocca<\/td><\/tr><tr><td>2<\/td><td>22:30:00 - 22:45<\/td><td><\/td><td>From COMET to COMES \u2013 Can Summary Evaluation Benefit from Translation Evaluation?<\/td><td>Mateusz Krubi\u0144ski, Pavel Pecina<\/td><\/tr><tr><td>2<\/td><td>22:45:00 - 23:00<\/td><td><\/td><td>Random Text Perturbations Work, but not Always<\/td><td>Zhengxiang Wang<\/td><\/tr><tr><td>3<\/td><td>23:30:00 - 00:15<\/td><td><\/td><td>Invited Talk<\/td><td>Maxime Peyrard<\/td><\/tr><tr><td>3<\/td><td>00:15:00 - 00:30<\/td><td><\/td><td>A Comparative Analysis of Stance Detection Approaches and Datasets<\/td><td>Parush Gera, Tempestt Neal<\/td><\/tr><tr><td>3<\/td><td>00:30:00 - 00:45<\/td><td><\/td><td>Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset<\/td><td>Guanyi Chen, Fahime Same, Kees Van Deemter<\/td><\/tr><tr><td>3<\/td><td>00:45:00 - 01:00<\/td><td><\/td><td>Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis<\/td><td>Zachary Zhou, Alisha Zachariah, Devin Conathan, Jeffery Kline<\/td><\/tr><tr><td>3<\/td><td>01:00:00 - 01:15<\/td><td><\/td><td>GLARE: Generative Left-to-right AdversaRial Examples<\/td><td>Ryan Andrew Chi, Nathan Kim, Patrick Liu, Zander Lack, Ethan A Chi<\/td><\/tr><tr><td>4<\/td><td>01:30-02:10<\/td><td><\/td><td>Invited Talk Masakhane<\/td><td>Masakhane<\/td><\/tr><tr><td>4<\/td><td>02:10-02:20<\/td><td><\/td><td>SUMEval shared task overview<\/td><td>Kabir Ahuja, Antonios Anastasopoulos, Barun Patra, Graham Neubig, Monojit Choudhury, Sandipan Dandapat, Sunayana Sitaram and Vishrav Chaudhary<\/td><\/tr><tr><td>4<\/td><td>02:20-02:30<\/td><td>SUMEval-2<\/td><td>To Train or Not to Train: Predicting the Performance of Massively Multilingual Models<\/td><td>Shantanu Patankar, Omkar Gokhale, Onkar Litake, Aditya Mandke and Dipali Kadam<\/td><\/tr><tr><td>4<\/td><td>02:30-02:40<\/td><td>SUMEval-1<\/td><td>The GMU System Submission for the SUMEval 2022 Shared Task<\/td><td>Syeda Sabrina Akter and Antonios Anastasopoulos<\/td><\/tr><tr><td>4<\/td><td>02:40-02:55<\/td><td>SUMEval-5<\/td><td>NTREX-128 -- News Test References for MT Evaluation of 128 Languages<\/td><td>Christian Federmann, Tom Kocmi, Ying Xin<\/td><\/tr><tr><td>4<\/td><td>02:55-03:10<\/td><td>SUMEval-6<\/td><td>IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages<\/td><td>Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwariant<\/td><\/tr><tr><td>4<\/td><td>03:10-03:20<\/td><td><\/td><td>Closing<\/td><td><\/td><\/tr><\/tbody><\/table><figcaption>SUMEval 2022 Workshop Schedule on 20 November 2022<\/figcaption><\/figure>\n<!-- \/wp:table -->\n<!-- \/wp:msr\/content-tab -->\n<!-- \/wp:msr\/content-tabs -->","tab-content":[],"msr_startdate":"2022-11-20","msr_enddate":"","msr_event_time":"","msr_location":"Online only","msr_event_link":"","msr_event_recording_link":"","msr_startdate_formatted":"November 20, 2022","msr_register_text":"Watch now","msr_cta_link":"","msr_cta_text":"","msr_cta_bi_name":"","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-960x540.png\" class=\"img-object-cover\" alt=\"SumEval logo\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/sumeval_logo_banner-62a1a0e27fb35-1280x720.png 1280w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","event_excerpt":"Massively Multilingual Language Models (MMLMs) are trained on around 100 languages of the world, however, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages. The languages present in evaluation benchmarks are usually high-resource and largely belong to the Indo-European language family. This makes current multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. Although efforts are being made to&hellip;","msr_research_lab":[],"related-researchers":[],"msr_impact_theme":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-opportunities":[],"related-publications":[],"related-videos":[],"related-posts":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/851413","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-event"}],"version-history":[{"count":28,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/851413\/revisions"}],"predecessor-version":[{"id":897459,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/851413\/revisions\/897459"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/851050"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=851413"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=851413"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=851413"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=851413"},{"taxonomy":"msr-video-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video-type?post=851413"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=851413"},{"taxonomy":"msr-program-audience","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-program-audience?post=851413"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=851413"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=851413"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}