Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
- Muhammad N. ElNokrashy ,
- Amr Hendy ,
- Mohamed Abdelghaffar ,
- M. Afify ,
- Ahmed Tawfik ,
- H. Awadalla
ArXiv | , Vol abs/2011.07933
This paper presents the description of our submission to WMT20 sentence filtering task. We combine scores from custom LASER built for each source language, a classifier built to distinguish positive and negative pairs and the original scores provided with the task. For the mBART setup, provided by the organizers, our method shows 7% and 5% relative improvement, over the baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.