Curriculum Data Augmentation for Highly Multiclass Text Classification
- Jason Wei ,
- Chengyu Huang ,
- Soroush Vosoughi ,
- Yu Cheng
19th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) |
This paper explores data augmentation—a technique particularly suitable for training with limited data—for highly multiclass text classification tasks, which have a large number of output classes. On four diverse highly multi-class tasks, we find that well-known data augmentation techniques (Sennrich et al., 2016b;Wang et al., 2018; Wei and Zou, 2019) can improve performance by up to 3.0% on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation improves performance, trains faster, and maintains robustness high augmentation temperatures (strengths)