{"id":717850,"date":"2021-01-19T13:26:10","date_gmt":"2021-01-19T21:26:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=717850"},"modified":"2021-03-08T13:04:49","modified_gmt":"2021-03-08T21:04:49","slug":"three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation\/","title":{"rendered":"Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_Ensemble_No_logo-2.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, \u201cmodern age\u201d neural network training\u2014at least for image classification tasks and many others\u2014is usually quite stable. Using standard neural network architectures and training algorithms (typically SGD with momentum), the learned models perform consistently well, not only in terms of training accuracy but even in test accuracy, regardless of which random initialization or random data order is used during the training. For instance, if one trains the same WideResNet-28-10 architecture on the CIFAR-100 dataset 10 times with different random seeds, the mean test accuracy is 81.51% while the standard deviation is only 0.16%.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/towards-understanding-ensemble-knowledge-distillation-and-self-distillation-in-deep-learning\/\" data-bi-cN=\"Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>In a new paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/towards-understanding-ensemble-knowledge-distillation-and-self-distillation-in-deep-learning\/\">Towards Understanding Ensemble, Knowledge Distillation, and Self-Distillation in Deep Learning<\/a>,\u201d we focus on studying the discrepancy of neural networks during the training process that has arisen purely from randomizations. We ask the following questions: besides this small deviation in test accuracies, do the neural networks trained from different random initializations actually learn very different functions? If so, where does the discrepancy come from? How do we reduce such discrepancy and make the neural network more stable or even better? These questions turn out to be quite nontrivial, and they relate to the mysteries of three techniques widely used in deep learning.<\/p>\n\n\n\n<h2 id=\"three-of-the-mysteries-in-deep-learning\">Three of the mysteries in deep learning<\/h2>\n\n\n\n<p><strong>Mystery 1: Ensemble. <\/strong>The learned networks \\(F_1\\),\u2026\\(F_{10}\\) using different random seeds\u2014despite having very similar test performance\u2014are observed to associate with very different functions. Indeed, using a well-known technique called ensemble, which merely takes the unweighted average of the outputs of these independently trained networks, one can obtain a huge boost in test-time performance in many deep learning applications. (See Figure 1 below.) This implies the individual functions \\(F_1\\),\u2026\\(F_{10}\\) must be different. However, why does ensemble work with a sudden performance boost? Alternatively, if one directly trains (\\(F_1\\)+\u22ef+\\(F_{10}\\))\/10 altogether, why does the performance boost disappear?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog-1024x388.jpg\" alt=\"F(1) F(1) and F(10) WideResNet-28-10 architecture shown, trained on the CIFAR-100 dataset. Text above the three seeds reads \"train the same WideResNet-28-10 architecture with 10 different seeds but same learning rate, same weight decay, same lr schedule, same momentum, same batch size...\" The three seeds output an unweighted average,  and each seed points to a blue box that reads \"test accuracies 81.51 plus or minus .16%. Four boxes bracketed with \"Why? Mystery 1\" read\" test accuracy less than or equal to 82%; train (f1+...+F1-)\/10 altogether with different seeds equals no performance boost; output test accuracy 84.87% (ensemble); train F1,...,F10 separately and take average equals performance boost. \" class=\"wp-image-717853\" width=\"900\" height=\"341\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog-1024x388.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog-300x114.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog-768x291.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog-16x6.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure1_esemble-blog.jpg 1501w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><figcaption>Figure 1: Ensemble gives a performance boost to test accuracies in deep learning applications, but such accuracy gains cannot be matched by training the average of the models directly.<\/figcaption><\/figure>\n\n\n\n<p><strong>Mystery 2:<\/strong> <strong>Knowledge distillation. <\/strong>While ensemble is great for improving test-time performance, it becomes 10 times slower during inference time (that is, test time): we need to compute the outputs of 10 neural networks instead of one. This is an issue when we deploy such models in a low-energy, mobile environment. To fix it, a seminal technique called <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1503.02531\">knowledge distillation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> was proposed. That is, <em>knowledge distillation <\/em>simply trains another individual model to match the <em>output <\/em>of the ensemble. Here, the output of the ensemble (also called the <em>dark knowledge<\/em>) on a cat image may look like \u201c80% cat + 10% dog + 10% car,\u201d while the true training label is \u201c100% cat.\u201d (See Figure 2 below.)<\/p>\n\n\n\n<p>It turns out the so-trained individual model can, to great extent, match the test-time performance of a 10-times-bigger ensemble. However, this leads to more questions. Why does matching the outputs of the ensemble give us better test accuracy compared to matching the true labels? Moreover, can we perform ensemble learning over the models after knowledge distillation to further improve test accuracy?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"532\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated-1024x532.jpg\" alt=\"This image has three parts. The top third shows Figure 1, F1 through F10 seeds. Read Figure 1 for reference.\n\nThe middle third shows what happens if this process is adjusted to train a single model to \"match\" the output of ensemble. Lines drawn between output here and output of figure above indicate \"matching\" a soft-label such as 80%cat + 10% dog + 10% car instead of a hard-label \"cat\" from the dataset. Test accuracy 83.81% using knowledge distillation gives a performance boost. This is mystery two. \n\nThe bottom third shows what happens if this process is adjusted to train a single model to \"match\" the output of a single model of a different seed. Test accuracy of 83.56% using self-distillation generates a performance boot and indicates mystery three. \" class=\"wp-image-718183\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated-1024x532.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated-300x156.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated-768x399.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated-16x8.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_Figre2_updated.jpg 1208w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: Knowledge distillation and self-distillation also give performance boosts in deep learning.<\/figcaption><\/figure>\n\n\n\n<p><strong>Mystery 3: Self-distillation.<\/strong> Note that knowledge distillation at least intuitively makes sense: the teacher ensemble model has 84.8% test accuracy, so the student individual model can achieve 83.8%. The following phenomenon, called <em>self-distillation<\/em> (or \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1905.08094.pdf\">Be Your Own Teacher<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u201d), is completely astonishing\u2014by performing knowledge distillation against an individual model of the same architecture, test accuracy can also be improved. (See Figure 2 above.) Consider this: if training an individual model only gives 81.5% test accuracy, then how come \u201ctraining the same model again <em>using itself as the teacher<\/em>\u201d suddenly boosts the test accuracy consistently to 83.5%?<\/p>\n\n\n\n<h2 id=\"ensemble-of-neural-networks-versus-ensemble-of-feature-mappings\">Ensemble of neural networks versus ensemble of feature mappings<\/h2>\n\n\n\n<p>Most existing theories on ensemble only apply to the case where individual models are fundamentally different (for example, decision trees supported on different subsets of the variables) or trained over different datasets (such as bootstrapping). They cannot justify the aforementioned phenomenon in the deep learning world, where individually trained neural networks are of the <em>same <\/em>architecture and using the <em>same <\/em>training data\u2014their only difference comes from the randomness during training.<\/p>\n\n\n\n<p>Perhaps the existing theorem closest to matching ensemble in deep learning is the <em>ensemble of random feature mappings<\/em>. On one hand, combining multiple linear models of random (prescribed) features should improve test-time performance because it increases the number of features. On the other hand, in certain parameter regimes, neural network weights can stay very close to their initializations (known as the neural tangent kernel, or NTK, regime), and the resulting network is merely learning a linear function over prescribed feature mappings that are completely decided by the random initialization (see <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-convergence-theory-for-deep-learning-via-over-parameterization\/\">this work<\/a>). When combining these two, one may conjecture that ensemble in deep learning shares the principle of ensemble in random feature mappings. That leads us to the following question:<\/p>\n\n\n\n<p><strong>Does ensemble\/knowledge distillation work in the same way in deep learning compared to that in random feature mappings (namely, the NTK feature mappings)?<\/strong><\/p>\n\n\n\n<p><strong>Answer: not really, as evidenced by the experiment in Figure 3 below.<\/strong> This figure compares ensemble and knowledge distillation in deep learning versus that in a linear model over random feature mappings. Ensemble works in <em>both <\/em>cases. However, the accuracies in Figure 3 clearly show that they work for <em>completely different <\/em>reasons. Specifically:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Unlike in the deep learning case, the superior performance of ensemble in the random feature setting cannot be distilled to an individual model. For instance, in Figure 3, the neural tangent kernel (NTK) models\u2019 ensemble achieves 70.54% accuracy on the CIFAR-10 dataset, but after knowledge distillation, it goes down to 66.01%, even worse than the test accuracy of 66.68% on the individual model. <\/li><li>In deep learning, direct training the average of models (\\(F_1\\)+\u22ef+\\(F_{10}\\))\/10 offers no benefit compared to training one individual model \\(F_i\\); while in the random feature setting, training the average outperforms both individual models and their ensemble. For instance, in Figure 3, the NTK models\u2019 ensemble achieves 70.54% accuracy, but this is even worse than directly training the average of 10 models, which gives 72.86% accuracy.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"368\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble-1024x368.png\" alt=\"Accuracies for ensemble, knowledge distillation, and self-distillation in different scenarios. Random Feature Mappings (CIFAR-10, CIFAR-100): percentage accuracy of directly training the average of 10 models, (72.86, 41.47); enemble accuracy over 10 (70.54, 38.32); individual model accuracies (66.68, 31.90); knowledge and self-distillation (66.01, 61.92 and 31.38%, 27.64%). \n\nDeep learning: ensemble accuracy over 10 (97.20, 84.69); knowledge and self-distillation accuracies (97.22, 97.13 and 83.81 and 83.56); accuracy of directly training the average of 10 models (96.46, 81.83); individual model accuracies (96.70 plus or minus .21, 81.51 plus or minus .16). See paper for more examples. \" class=\"wp-image-717859\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble-1024x368.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble-300x108.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble-768x276.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble-16x6.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Figure3_emsemble.png 1256w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: Ensemble works in random feature mappings (but for a completely different reason from that in deep learning), and knowledge distillation does not work in random feature mapping.<\/figcaption><\/figure>\n\n\n\n<p>The reason for this is that the neural network is performing <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/backward-feature-correction-how-deep-learning-performs-deep-learning\/\"><em>hierarchical feature learning<\/em><\/a>\u2014each individual model \\(F_i\\), despite having different random initializations, is still capable of learning the same set of features as the others. Thus, their average offers almost no additional capacity compared to an individual network. However, in the linear setting, each \\(F_i\\) uses a different set of prescribed features; so, although combining these features (using either ensemble or direct training average) does offer an advantage, they just cannot by distilled into an individual model due to the scarcity of features.<\/p>\n\n\n\n<h2 id=\"ensemble-versus-reducing-variance-of-individual-models\">Ensemble versus reducing variance of individual models<\/h2>\n\n\n\n<p>Besides ensemble of random features, one might also conjecture that, due to the high complexity of the neural network, each individual model \\(F_i\\) might learn a function \\(F_i\\) \\((x)=y+\u03be_i\\), where \\(\u03be_i\\) is some noise that depends on the randomness used during training. Classical statistics suggests that if all the \\(\u03be_i\\)&#8216;s are roughly independent, then averaging them greatly reduces the amount of noise. Thus,<\/p>\n\n\n\n<p><strong>Can \u201censemble reduces the variance\u201d be the reason for ensemble\u2019s performance boost?<\/strong><\/p>\n\n\n\n<p><strong>Answer<\/strong>: our evidence shows that this hypothesis of reducing variance is very questionable in the context of deep learning:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Ensemble does not increase test accuracy forever: ensemble over 100 individual models typically makes no difference when compared to ensemble over 10 individual models. So, the average of 100 \\(\u03be_i\\)&#8216;s does not reduce the variance anymore comparing to 10 \\(\u03be_i\\)&#8216;s\u2014indicating that these \\(\u03be_i\\)&#8216;s are (1) potentially not independent, and\/or (2) can be biased so the mean is not zero. In the event of (1), it is difficult to argue how much error can be reduced by averaging these \\(\u03be_i\\)&#8216;s<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>Even if one wishes to accept the idealistic belief that (1) does not occur so all these \\(\u03be_i\\)&#8216;s are just biased, or in symbols, \\(F_i (x) = y +\u03be+\u03be_i\\)&#8216; where \\(\u03be\\) is a common error and \\(\u03be_i\\)&#8216; is an individual, independent error, then, why does knowledge distillation work? After ensemble we expect the network to output close to y+\u03be with a common bias \\(\u03be\\). Then, why is this output with error \u03be (also known as dark knowledge) better than the original true label for training?<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>In Figure 4, we see that ensemble learning of neural networks does not always improve test accuracy\u2014at least when the input is Gaussian-like. In other words, \u201caveraging these \u201d does not lead to any accuracy gain in these networks. Therefore, we need to understand ensemble in deep learning with more care than a generic statement such as \u201creducing variance.\u201d<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"452\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated-1024x452.jpg\" alt=\"When inputs are {Gaussian or mixture of Gaussians} and generated through {uniform or rejection} sampling, \n\nOR\n\nWhen labels are generated by {linear or fully connected or convolutional or residual} networks {with or without} margin and {with or without} noise,\n\nour experiments suggest that ensemble accuracy is less than or equal to individual model accuracies.\" class=\"wp-image-718189\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated-1024x452.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated-300x133.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated-768x339.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated-16x7.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/Ensemble_fig4_updated.jpg 1218w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4: When inputs are Gaussian-like, experiments suggest that ensemble does not improve test accuracy.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"multi-view-data-a-new-approach-to-justify-ensemble-in-deep-learning\">Multi-view data: A new approach to justify ensemble in deep learning<\/h2>\n\n\n\n<p>Since ensemble is less likely to work under unstructured random inputs (see Figure 4), we have to look at special structure in the data to understand it properly.<\/p>\n\n\n\n<p>In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/towards-understanding-ensemble-knowledge-distillation-and-self-distillation-in-deep-learning\/\">our new work<\/a>, we propose studying a common structure that can be found in many of the datasets where deep learning excels. In vision datasets in particular, the object can usually be classified using <em>multiple views<\/em>. For example, a car image can be classified as a car by looking at the headlights, the wheels, or the windows. For a typical placement of a car in images, we can observe all these features, and it suffices to use one of the features to classify it as a car. However, there are some car images taken from a particular angle, where one or more of these features are missing. For example, an image of a car facing forward might be missing the wheel feature. We give real-life examples in Figure 5.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"284\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble-1024x284.png\" alt=\"Images of various representations of cars and horses. Text reads\" ResNet-34 learns three features (views) of a car: (1) front wheel (2) front window (3) side window. ResNet-34 learns three features (views) of a horse: (1) tail (2) legs (3) head. \" class=\"wp-image-717880\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble-1024x284.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble-300x83.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble-768x213.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble-16x4.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/figure5-ensemble.png 1116w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 5: Visualization of some channels in layer 23 of ResNet-34 trained on CIFAR-10<\/figcaption><\/figure>\n\n\n\n<p>We refer to this structure as \u201cmulti-view,\u201d where each class of the data has multiple view features. In most of the data, almost all of the view features will show up, but in some data, some view features might be missing. (More broadly speaking, the \u201cmulti-view\u201d structure shows up not only in the input pixel space but also in the intermediate layers; we refer interested readers to Figure 6 of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/towards-understanding-ensemble-knowledge-distillation-and-self-distillation-in-deep-learning\/\">our paper<\/a> for an experimental justification.)<\/p>\n\n\n\n<p>We develop a theorem showing that, during the training process of a neural network under multi-view data, the network will:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\"><li>Quickly learn a <em>subset<\/em> of these view features depending on the randomness used in the learning process.<br><\/li><li>Memorize the small number of remaining data that cannot be classified correctly using these view features.<\/li><\/ol>\n\n\n\n<p>The first point implies that ensemble of different networks will collect all these learnable view features, hence achieving a higher test accuracy. The second point implies that individual models do not learn all the view features <em>not <\/em>because they do not have enough capacity, but rather because there are not sufficiently many training data left to learn these views. Most of the data has already been classified correctly with existing view features, so they essentially provide no gradient at this stage of training.<\/p>\n\n\n\n<h2 id=\"knowledge-distillation-forcing-an-individual-model-to-learn-multiple-views\">Knowledge distillation: Forcing an individual model to learn multiple views<\/h2>\n\n\n\n<p>In this new work, we continue to show how knowledge distillation works. In real-life scenarios, some car images might look \u201cmore like a cat\u201d compared to others: for example, the headlights on some car images might look like cat eyes. When this happens, the ensemble model can provide meaningful dark knowledge: for instance, \u201ccar image X is 10% like a cat.\u201d<\/p>\n\n\n\n<p>Now comes the key observation. When training an individual neural network model, if the \u201cheadlight\u201d view is not learned, then even if the remaining views can still correctly label image X as a car, they cannot be used to match the dark knowledge \u201cimage X is 10% like a cat.\u201d In other words, during knowledge distillation, the individual model is forced to learn every possible view feature, matching the performance of ensemble. Note that the crux of knowledge distillation in deep learning is that an individual model, as a neural network, is performing feature learning and therefore capable of learning all the features of ensemble. This is consistent with what we observe in practice. (See Figure 6.)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"275\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6-1024x275.png\" alt=\"Performance if various single, ensemble, and knowledge distillation models on different versions of ResNet. The big takeaway from the numbers is that knowledge distillation on models after ensemble learning offers no additional performance boost. \" class=\"wp-image-717883\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6-1024x275.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6-300x81.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6-768x207.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6-16x4.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/ensemble_figure6.png 1428w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 6: Knowledge distillation has learned most of the view features from the ensemble, and so ensemble learning on models after knowledge distillation offers no more performance boost.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"self-distillation-implicitly-combining-ensemble-and-knowledge-distillation\">Self-distillation: Implicitly combining ensemble and knowledge distillation<\/h2>\n\n\n\n<p>In this new work, we also give theoretical support to knowledge <em>self-distillation<\/em> (recall Figure 3). Training an individual model to match the output of another identical individual model (but using a different random seed) somehow gives a performance boost.<\/p>\n\n\n\n<p>At a high level, we view self-distillation as combining ensemble and knowledge distillation in a more compact manner. When learning an individual model \\(F_2\\) from random initialization to match the output of a separately trained individual model \\(F_1\\), one can expect \\(F_2\\) to learn a subset of the features depending on its own random initialization. In addition to this, \\(F_2\\) also has the incentive to learn the subset of features already learned by \\(F_1\\). In other words, one can view this process as \u201censemble learning two individual models \\(F_1\\),\\(F_2\\) and distilling it to \\(F_2\\).\u201d The final learned model \\(F_2\\) may not necessarily cover all the learnable views in the dataset, but it has the potential to at least learn all the views that can be covered through ensemble learning over two individual models. This is where the test-time performance boost comes from! (Recall Figures 2 and 3.)<\/p>\n\n\n\n<h2 id=\"conclusion-and-going-forward\">Conclusion and going forward<\/h2>\n\n\n\n<p>In this work, we show, to the best of our knowledge, the first theoretical proof toward understanding how ensemble works in deep learning. We also provide empirical evidence to support our theory and our \u201cmulti-view\u201d data hypothesis. We believe our framework can be applied to other settings. For example, data augmentation using random cropping could be potentially regarded as another way to enforce the network to learn \u201cmulti-views.\u201d We hope that, in practice, our new theoretical insights on how neural networks pick up features during training can also help design new, principled approaches to improve the test accuracy of a neural network, potentially matching that of ensemble.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, \u201cmodern age\u201d neural network training\u2014at least for image classification tasks and many others\u2014is usually quite stable. Using standard neural network architectures and training algorithms (typically SGD with momentum), the learned models perform consistently well, not only in terms of training accuracy but even in [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":718264,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-717850","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"yuanzhi-li","user_id":"670110","display_name":"Yuanzhi  Li","author_link":"<a href=\"https:\/\/www.andrew.cmu.edu\/user\/yuanzhil\/\" aria-label=\"Visit the profile page for Yuanzhi  Li\">Yuanzhi  Li<\/a>","is_active":true,"last_first":"Li, Yuanzhi ","people_section":0,"alias":"yuanzhi-li"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-960x540.jpg\" class=\"img-object-cover\" alt=\"diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-1024x577.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-768x433.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-2048x1154.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_ensemble_no_logo_still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Zeyuan Allen-Zhu and <a href=\"https:\/\/www.andrew.cmu.edu\/user\/yuanzhil\/\" title=\"Go to researcher profile for Yuanzhi  Li\" aria-label=\"Go to researcher profile for Yuanzhi  Li\" data-bi-type=\"byline author\" data-bi-cN=\"Yuanzhi  Li\">Yuanzhi  Li<\/a>","formattedDate":"January 19, 2021","formattedExcerpt":"Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, \u201cmodern age\u201d neural network training\u2014at least for image classification tasks and many others\u2014is usually quite stable. Using standard neural network architectures and training algorithms (typically SGD with momentum), the learned models perform consistently well,&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/717850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=717850"}],"version-history":[{"count":55,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/717850\/revisions"}],"predecessor-version":[{"id":718294,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/717850\/revisions\/718294"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/718264"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=717850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=717850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=717850"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=717850"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=717850"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=717850"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=717850"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=717850"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=717850"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=717850"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=717850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}