Video
Understanding Knowledge Distillation in Neural Sequence Generation
Sequence-level knowledge distillation (KD) — learning a student model with targets decoded from a pre-trained teacher model — has been widely used in sequence generation applications (e.g. model compression, non-autoregressive translation (NAT), low-resource translation, etc).…
Publication