Efficient Machine Translation with Model Pruning and Quantization

Maximiliana Behnke; Nikolay Bogoychev; Alham Fikri Aji; Kenneth Heafield; Graeme Nail; Qianqian Zhu; Svetlana Tchistiakova; J. V. D. Linde; Pinzhen Chen; Sidharth Kashyap; Roman Grundkiewicz

Efficient Machine Translation with Model Pruning and Quantization

Maximiliana Behnke ,
Nikolay Bogoychev ,
Alham Fikri Aji ,
Kenneth Heafield ,
Graeme Nail ,
Qianqian Zhu ,
Svetlana Tchistiakova ,
J. V. D. Linde ,
Pinzhen Chen ,
Sidharth Kashyap ,
Roman Grundkiewicz

Proceedings of the Sixth Conference on Machine Translation | November 2021

Published by Association for Computational Linguistics

Download BibTex

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.