Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation
Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, āmodern ageā neural network trainingāat least for image classification tasks and many othersāis usually quite stable. Using standard neural network architectures and training algorithms (typically SGD with momentum), the learned models perform consistently well,…