Hierarchical Phrase-Based Translation with Suffix Arrays.

May 21, 2007
Adam Lopez | University of Maryland

A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and extract rules on the fly. Hierarchical phrase-based translation introduces the added wrinkle of source phrases with gaps. Lookup algorithms used for contiguous phrases no longer apply and the best approximate pattern matching algorithms are much too slow, taking several minutes per sentence. I describe new lookup algorithms for hierarchical phrase-based translation that reduce the empirical computation time by nearly two orders of magnitude, making on-the-fly lookup feasible for

source phrases with gaps. I will also discuss some novel applications of these algorithms.

Speaker Details

Adam Lopez is a Ph.D. candidate in computer science at the University of Maryland, expecting to graduate in August 2007. His dissertation work focuses on statistical machine translation and his interests are in large-scale natural language processing and algorithms. Prior to graduate school, he worked as a software engineer at the IBM Corporation, after receiving his bachelor’s degree in computer science from Duke University.