Abstract

The paper explores a way to learn post-editing fixes of raw MT outputs automatically by combining two different types of statistical machine translation (SMT) systems in a linear fashion. Our proposed system (which we call a chained system) consists of two SMT systems: (i) a syntax-based SMT system and (ii) a phrase-based SMT system (Koehn, 2004). We first translate source sentences of the bi-text training data into a target language, using the syntax-based SMT. This provides us the monolingual parallel data that consist of the raw MT outputs and their corresponding human translations. We then build a phrase-based SMT system, using the monolingual parallel corpus. Our system is thus a chain of a syntax-based SMT system and a phrase-based SMT system. The benefit of the chained system is to learn post-editing fixes automatically via a phrase-based SMT system (Simard, et al., 2007a/b). We investigated the impact from the chained system on the initial SMT system in terms of BLEU, using typologically different language pairs. The results of our experiments strongly indicate that the second part of the chained system can compensate the weaknesses of the initial SMT system in a robust way by providing human-like fixes.

‚Äč