Curriculum Data Augmentation for Highly Multiclass Text Classification

  • Jason Wei ,
  • Chengyu Huang ,
  • Soroush Vosoughi ,
  • Yu Cheng

19th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) |

This paper explores data augmentation—a technique particularly suitable for training with limited data—for highly multiclass text classification tasks, which have a large number of output classes. On four diverse highly multi-class tasks, we find that well-known data augmentation techniques (Sennrich et al., 2016b;Wang et al., 2018; Wei and Zou, 2019) can improve performance by up to 3.0% on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation improves performance, trains faster, and maintains robustness high augmentation temperatures (strengths)