Dramatically Reducing Training Data Size through Vocabulary Saturation

Will Lewis; Sauleh Eetemadi

Dramatically Reducing Training Data Size through Vocabulary Saturation

Will Lewis ,
Sauleh Eetemadi

Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL 2013 | August 2013

Published by ACL

Download BibTex

Our ﬁeld has seen signiﬁcant improvements in the quality of machine translation systems over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. However, we now ﬁnd ourselves the victims of our own success, in that it has become increasingly difﬁcult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e., data to models takes an inordinate amount of time). Some teams have dealt with this by focusing on data cleaning to arrive at smaller datasets (Denkowski et al., 2012a; Rarrick et al., 2011), “domain adaptation” to arrive at data more suited to the task at hand (Moore and Lewis, 2010; Axelrod et al., 2011), or by speciﬁcally focusing on data reduction by keeping only as much data as is needed for building models e.g., (Eck et al., 2005). This paper focuses on techniques related to the latter efforts. We have developed a very simple n-gram counting method that reduces the size of data sets dramatically, as much as 90%, and is applicable independent of speciﬁc dev and test data. At the same time it reduces model sizes, improves training times, and, because it attempts to preserve contexts for all n-grams in a corpus, the cost in quality is minimal (as measured by BLEU). Further, unlike other methods created specifically for data reduction that have similar effects on the data, our method scales to very large data, up to tens to hundreds of millions of parallel sentences.