InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

arXiv

View Publication

In this work, we formulate cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, the information-theoretic framework inspires us to propose a pre-training task based on contrastive learning. Given a bilingual sentence pair, we regard them as two views of the same meaning, and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available on GitHub.

Publication Downloads

UniLM – Unified Language Model Pre-training

October 1, 2019

We develop pre-trained models for natural language understanding (NLU) and generation (NLG) tasks. ***** New October 1st, 2019: UniLM v1 release ***** UniLM v1 (September 30th, 2019): the code and pre-trained models for the NeurIPS 2019 paper entitled "Unified Language Model Pre-training for Natural Language Understanding and Generation". UniLM (v1) achieves the new SOTA results in NLG (especially sequence-to-sequence generation) tasks/benchmarks, including abstractive summarization (the Gigaword and CNN/DM dataset), question generation (the SQuAD QG dataset), etc. UniLM v2: the new pre-training protocol and implementation scheme (coming soon).

Download Data