Efficient Machine Translation with Model Pruning and Quantization

  • ,
  • Nikolay Bogoychev ,
  • Alham Fikri Aji ,
  • Kenneth Heafield ,
  • Graeme Nail ,
  • Qianqian Zhu ,
  • Svetlana Tchistiakova ,
  • J. V. D. Linde ,
  • Pinzhen Chen ,
  • Sidharth Kashyap ,
  • Roman Grundkiewicz

Proceedings of the Sixth Conference on Machine Translation |

Published by Association for Computational Linguistics

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.