Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

  • Muhammad N. ElNokrashy ,
  • Amr Hendy ,
  • Mohamed Abdelghaffar ,
  • M. Afify ,
  • ,
  • H. Awadalla

ArXiv | , Vol abs/2011.07933

Publication | Publication

This paper presents the description of our submission to WMT20 sentence filtering task. We combine scores from custom LASER built for each source language, a classifier built to distinguish positive and negative pairs and the original scores provided with the task. For the mBART setup, provided by the organizers, our method shows 7% and 5% relative improvement, over the baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.