{"id":670482,"date":"2020-06-30T08:34:28","date_gmt":"2020-06-30T15:34:28","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=670482"},"modified":"2021-01-10T11:18:27","modified_gmt":"2021-01-10T19:18:27","slug":"backward-feature-correction-how-deep-learning-performs-deep-learning","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/backward-feature-correction-how-deep-learning-performs-deep-learning\/","title":{"rendered":"Backward Feature Correction: How Deep Learning Performs Deep Learning"},"content":{"rendered":"<p>How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically by applying SGD.<\/p>\n<p>On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how deep neural networks can be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a principle called &#8220;backward feature correction&#8221;, where training higher layers in the network can improve the features of lower level ones. We believe this is the key to understand the deep learning process in multi-layer neural networks.<\/p>\n<p>On the technical side, we show for every input dimension <span id=\"MathJax-Element-1-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-1\" class=\"math\"><span id=\"MathJax-Span-2\" class=\"mrow\"><span id=\"MathJax-Span-3\" class=\"mi\">d<\/span><span id=\"MathJax-Span-4\" class=\"mo\">><\/span><span id=\"MathJax-Span-5\" class=\"mn\">0<\/span><\/span><\/span><\/span>, there is a concept class consisting of degree\u00a0<span id=\"MathJax-Element-2-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-6\" class=\"math\"><span id=\"MathJax-Span-7\" class=\"mrow\"><span id=\"MathJax-Span-8\" class=\"mi\">\u03c9<\/span><span id=\"MathJax-Span-9\" class=\"mo\">(<\/span><span id=\"MathJax-Span-10\" class=\"mn\">1<\/span><span id=\"MathJax-Span-11\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0multi-variate polynomials so that, using\u00a0<span id=\"MathJax-Element-3-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-12\" class=\"math\"><span id=\"MathJax-Span-13\" class=\"mrow\"><span id=\"MathJax-Span-14\" class=\"mi\">\u03c9<\/span><span id=\"MathJax-Span-15\" class=\"mo\">(<\/span><span id=\"MathJax-Span-16\" class=\"mn\">1<\/span><span id=\"MathJax-Span-17\" class=\"mo\">)<\/span><\/span><\/span><\/span>-layer neural networks as learners, SGD can learn any target function from this class in\u00a0<span id=\"MathJax-Element-4-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-18\" class=\"math\"><span id=\"MathJax-Span-19\" class=\"mrow\"><span id=\"MathJax-Span-20\" class=\"texatom\"><span id=\"MathJax-Span-21\" class=\"mrow\"><span id=\"MathJax-Span-22\" class=\"mi\">p<\/span><span id=\"MathJax-Span-23\" class=\"mi\">o<\/span><span id=\"MathJax-Span-24\" class=\"mi\">l<\/span><span id=\"MathJax-Span-25\" class=\"mi\">y<\/span><\/span><\/span><span id=\"MathJax-Span-26\" class=\"mo\">(<\/span><span id=\"MathJax-Span-27\" class=\"mi\">d<\/span><span id=\"MathJax-Span-28\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0time using\u00a0<span id=\"MathJax-Element-5-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-29\" class=\"math\"><span id=\"MathJax-Span-30\" class=\"mrow\"><span id=\"MathJax-Span-31\" class=\"texatom\"><span id=\"MathJax-Span-32\" class=\"mrow\"><span id=\"MathJax-Span-33\" class=\"mi\">p<\/span><span id=\"MathJax-Span-34\" class=\"mi\">o<\/span><span id=\"MathJax-Span-35\" class=\"mi\">l<\/span><span id=\"MathJax-Span-36\" class=\"mi\">y<\/span><\/span><\/span><span id=\"MathJax-Span-37\" class=\"mo\">(<\/span><span id=\"MathJax-Span-38\" class=\"mi\">d<\/span><span id=\"MathJax-Span-39\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0samples to any\u00a0<span id=\"MathJax-Element-6-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-40\" class=\"math\"><span id=\"MathJax-Span-41\" class=\"mrow\"><span id=\"MathJax-Span-42\" class=\"mfrac\"><span id=\"MathJax-Span-43\" class=\"mn\">1<\/span><span id=\"MathJax-Span-44\" class=\"mrow\"><span id=\"MathJax-Span-45\" class=\"texatom\"><span id=\"MathJax-Span-46\" class=\"mrow\"><span id=\"MathJax-Span-47\" class=\"mi\">p<\/span><span id=\"MathJax-Span-48\" class=\"mi\">o<\/span><span id=\"MathJax-Span-49\" class=\"mi\">l<\/span><span id=\"MathJax-Span-50\" class=\"mi\">y<\/span><\/span><\/span><span id=\"MathJax-Span-51\" class=\"mo\">(<\/span><span id=\"MathJax-Span-52\" class=\"mi\">d<\/span><span id=\"MathJax-Span-53\" class=\"mo\">)<\/span><\/span><\/span><\/span><\/span><\/span>\u00a0error, through learning to represent it as a composition of\u00a0<span id=\"MathJax-Element-7-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-54\" class=\"math\"><span id=\"MathJax-Span-55\" class=\"mrow\"><span id=\"MathJax-Span-56\" class=\"mi\">\u03c9<\/span><span id=\"MathJax-Span-57\" class=\"mo\">(<\/span><span id=\"MathJax-Span-58\" class=\"mn\">1<\/span><span id=\"MathJax-Span-59\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from\u00a0<span id=\"MathJax-Element-8-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-60\" class=\"math\"><span id=\"MathJax-Span-61\" class=\"mrow\"><span id=\"MathJax-Span-62\" class=\"msubsup\"><span id=\"MathJax-Span-63\" class=\"mi\">d<\/span><span id=\"MathJax-Span-64\" class=\"texatom\"><span id=\"MathJax-Span-65\" class=\"mrow\"><span id=\"MathJax-Span-66\" class=\"mi\">\u03c9<\/span><span id=\"MathJax-Span-67\" class=\"mo\">(<\/span><span id=\"MathJax-Span-68\" class=\"mn\">1<\/span><span id=\"MathJax-Span-69\" class=\"mo\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u00a0sample or time complexity to learn this concept class even to\u00a0<span id=\"MathJax-Element-9-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-70\" class=\"math\"><span id=\"MathJax-Span-71\" class=\"mrow\"><span id=\"MathJax-Span-72\" class=\"msubsup\"><span id=\"MathJax-Span-73\" class=\"mi\">d<\/span><span id=\"MathJax-Span-74\" class=\"texatom\"><span id=\"MathJax-Span-75\" class=\"mrow\"><span id=\"MathJax-Span-76\" class=\"mo\">\u2212<\/span><span id=\"MathJax-Span-77\" class=\"mn\">0.01<\/span><\/span><\/span><\/span><\/span><\/span><\/span>\u00a0error.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"ArXiv","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-6-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13561,13556,13546],"msr-publication-type":[193724],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-670482","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-computational-sciences-mathematics","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-6-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"ArXiv","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/abs\/2001.04413","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Zeyuan Allen-Zhu","user_id":36569,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Zeyuan Allen-Zhu"},{"type":"text","value":"Yuanzhi Li","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[],"msr_project":[392777],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"miscellaneous","related_content":{"projects":[{"ID":392777,"post_title":"Foundations of Optimization","post_name":"foundations-of-optimization","post_type":"msr-project","post_date":"2017-07-06 09:30:53","post_modified":"2018-12-04 14:12:39","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/foundations-of-optimization\/","post_excerpt":"Optimization methods are the engine of machine learning algorithms. Examples abound, such as training neural networks with stochastic gradient descent, segmenting images with submodular optimization, or efficiently searching a game tree with bandit algorithms. We aim to advance the mathematical foundations of both discrete and continuous optimization and to leverage these advances to develop new algorithms with a broad set of AI applications. Some of the current directions pursued by our members include convex optimization,&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/392777"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/670482","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/670482\/revisions"}],"predecessor-version":[{"id":716359,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/670482\/revisions\/716359"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=670482"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=670482"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=670482"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=670482"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=670482"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=670482"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=670482"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=670482"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=670482"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=670482"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=670482"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=670482"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=670482"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}