{"id":847438,"date":"2022-05-24T20:49:13","date_gmt":"2022-05-25T03:49:13","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/"},"modified":"2022-11-04T21:43:36","modified_gmt":"2022-11-05T04:43:36","slug":"focal-modulation-networks","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/focal-modulation-networks\/","title":{"rendered":"FocalNet: Focal Modulation Networks"},"content":{"rendered":"<p>In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components:\u00a0<span id=\"MathJax-Element-1-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-1\" class=\"math\"><span id=\"MathJax-Span-2\" class=\"mrow\"><span id=\"MathJax-Span-3\" class=\"mo\">(<\/span><span id=\"MathJax-Span-4\" class=\"mi\">i<\/span><span id=\"MathJax-Span-5\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels,\u00a0<span id=\"MathJax-Element-2-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-6\" class=\"math\"><span id=\"MathJax-Span-7\" class=\"mrow\"><span id=\"MathJax-Span-8\" class=\"mo\">(<\/span><span id=\"MathJax-Span-9\" class=\"mi\">i<\/span><span id=\"MathJax-Span-10\" class=\"mi\">i<\/span><span id=\"MathJax-Span-11\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and\u00a0<span id=\"MathJax-Element-3-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-12\" class=\"math\"><span id=\"MathJax-Span-13\" class=\"mrow\"><span id=\"MathJax-Span-14\" class=\"mo\">(<\/span><span id=\"MathJax-Span-15\" class=\"mi\">i<\/span><span id=\"MathJax-Span-16\" class=\"mi\">i<\/span><span id=\"MathJax-Span-17\" class=\"mi\">i<\/span><span id=\"MathJax-Span-18\" class=\"mo\">)<\/span><\/span><\/span><\/span>\u00a0modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base sizes achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224<span id=\"MathJax-Element-4-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-19\" class=\"math\"><span id=\"MathJax-Span-20\" class=\"mrow\"><span id=\"MathJax-Span-21\" class=\"mo\">\u00d7<\/span><\/span><\/span><\/span>224 and 384<span id=\"MathJax-Element-5-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-22\" class=\"math\"><span id=\"MathJax-Span-23\" class=\"mrow\"><span id=\"MathJax-Span-24\" class=\"mo\">\u00d7<\/span><\/span><\/span><\/span>384, respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1<span id=\"MathJax-Element-6-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-25\" class=\"math\"><span id=\"MathJax-Span-26\" class=\"mrow\"><span id=\"MathJax-Span-27\" class=\"mo\">\u00d7<\/span><\/span><\/span><\/span>\u00a0already surpasses Swin trained with 3<span id=\"MathJax-Element-7-Frame\" class=\"MathJax\"><span id=\"MathJax-Span-28\" class=\"math\"><span id=\"MathJax-Span-29\" class=\"mrow\"><span id=\"MathJax-Span-30\" class=\"mo\">\u00d7<\/span><\/span><\/span><\/span> schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications. Code is available at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/FocalNet\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/github.com\/microsoft\/FocalNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components:\u00a0(i)\u00a0hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Jianwei Yang","user_id":"40261"},{"type":"user_nicename","value":"Chunyuan Li","user_id":"37971"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"Microsoft","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2022-3-22","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13562],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[259048],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-847438","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2022-3-22","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"Microsoft","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/abs\/2203.11926","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Jianwei Yang","user_id":40261,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jianwei Yang"},{"type":"user_nicename","value":"Chunyuan Li","user_id":37971,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Chunyuan Li"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jianfeng Gao"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[873195],"msr_group":[144931],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/847438","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/847438\/revisions"}],"predecessor-version":[{"id":847606,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/847438\/revisions\/847606"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=847438"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=847438"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=847438"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=847438"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=847438"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=847438"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=847438"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=847438"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=847438"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=847438"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=847438"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=847438"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=847438"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}