Fast Pretraining

Unsupervised language pre-training has been widely adopted by many machine learning applications. However, as the pre-training task requires no human labeling effort, a massive scale of training corpus from the Web can be used to train models with billions of parameters, making the pre-training computationally expensive. We tackle the efficiency issue of language pre-training by analyzing and rethinking multiple dimensions of the methods, including data utilization, positional encoding, layer normalization, and self-attention distributions. Our proposed methods bring significant accelerations for language pre-training tasks.