Understanding Knowledge Distillation in Neural Sequence Generation
- Akiko Eriguchi, Jiatao Gu | Microsoft, Facebook AI Research
Sequence-level knowledge distillation (KD) — learning a student model with targets decoded from a pre-trained teacher model — has been widely used in sequence generation applications (e.g. model compression, non-autoregressive translation (NAT), low-resource translation, etc). However, the underlying reasons behind this success have, as of yet, been unclear. In this talk, we will try to tackle the understanding of KD particularly in two scenarios: (1) Learning a weak student from a strong teacher model while keeping the same parallel data used for training the teacher; (2) Learning a student from a teacher model of equal size while the targets are generated from additional monolingual data.
Watch Next
-
-
Evaluating the Cultural Relevance of AI Models and Products: Learnings on Maternal Health ASR, Data Augmentation and User Testing Methods
- Oche Ankeli,
- Ertony Bashil,
- Dhananjay Balakrishnan
-
Accelerating Multilingual RAG Systems
- Nandan Thakur
-
-
-
-
-
Making Sentence Embeddings Robust to User-Generated Content
- Lydia Nishimwe
-
MSR Talk: Unsupervised Speech Reverberation Control with Diffusion Implicit Bridges
- Eloi Moliner,
- Hannes Gamper
-