SumEval logo
November 20, 2022

SUMEval 2022

Scaling Up Multilingual Evaluation Workshop @ AACL 2022

Location: Online only

The task of performance prediction is to be able to accurately predict the performance of a model on a set of target languages. These languages may be present in the fine-tuning data (few-shot training) or may not be present (zero-shot training). The languages used for fine-tuning are referred to as pivots, while the languages that we would like to evaluate model on are targets. This shared task will consist of building a machine learning model that can accurately predict the performance of a multilingual model on languages and tasks that we do not have test data for, given accuracies of models on various combinations of pivot and target pairs.

diagram

Challenge details:

We will release a dataset containing evaluation scores of multiple MMLMs on different tasks and languages. These scores can be used to train models that can predict how MMLMs trained on different pivot configurations will perform on target languages. For example, our training dataset may be as follows:

MMLMTaskTraining Pivot configuration (language, data size)Evaluation result (language, score)
mBERTTyDiQA(en, 1000), (sw, 1000), (ko, 1600), (ar, 14000) (te, 0.81), (ko, 0.68), (id, 0.85), (en, 0.77)
XLMRTyDiQA(en, 10000), (bn, 2000), (te, 2000), (ru, 1000)(te, 0.84), (ko, 0.69), (id, 0.87), (en, 0.76)
Example training data released for the shared task

The task is now to predict the model’s performance, given the following training configuration and test languages:

MMLMTaskTraining Pivot configuration (language, data size)Predicted result (language, score)
XLMRTyDiQA(en, 5000), (ar, 2000), (fi, 1000), (ko, 3000)(te, ??), (ko, ??), (id, ??), (en, ??), (zh, ??), (ja, ??)
Example test instance

Predictions will need to be made on test languages included in the training data, as well as surprise languages. For more details on the task formulation, please refer to the papers at the bottom of this page.

Dataset: The training dataset for the challenge can be found here (opens in new tab). Please refer to the Readme in the folder for details.

Evaluation procedure and Baseline numbers: Evaluation will be done in two conditions: Leave One Configuration Out (LOCO) and Surprise Languages.

Evaluation instructions: Litmus/SumEval at main · microsoft/Litmus · GitHub (opens in new tab)

Challenge Timeline (tentative):

  • June 28 2022: Dataset release
  • July 1 2022: Baseline numbers release
  • August 1 2022: Test set release (non-surprise languages) and leaderboard opens.
  • August 8 2022: Surprise languages test set released.
  • August 15 2022: Challenge ends
  • August 25 2022: Paper submission deadline

References: