{"id":894768,"date":"2022-11-01T19:05:34","date_gmt":"2022-11-02T02:05:34","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=894768"},"modified":"2022-11-04T21:32:57","modified_gmt":"2022-11-05T04:32:57","slug":"focalnets-focusing-the-eyes-with-focal-modulation","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/focalnets-focusing-the-eyes-with-focal-modulation\/","title":{"rendered":"FocalNets: Focus Eyes with Focal Modulation"},"content":{"rendered":"\n<p>Human eyes have a dynamic focusing system that adjusts the focal regions in order to see the surroundings at all distances. When we look far away, up close, and back again, our eyes change focus rapidly to allow us to perceive things finely and coarsely. In computer vision (CV), It remains an open question how to build a neural network that can mimic this behavior and feasibly focus on various granularities of visual inputs towards different tasks.<\/p>\n\n\n\n<p>In the past few years, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\" rel=\"noopener noreferrer\">Transformers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2010.11929.pdf\">Vision Transformers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> have led to unprecedented AI breakthroughs in NLP and vision, respectively. For vision particularly, what makes the Transformers stand out is arguably the self-attention (SA) mechanism, which enables each query token to adaptively gather information from others.  It learns the dependencies across different visual tokens, which induces better generalization ability than the canonical convolution layer of static kernels. In the visual world, the input signal is often continuous and comes with an arbitrary granularity and scope. Nevertheless, SA is typically used to modeling over a fixed set of predetermined tokens in a specific scope and granularity, and the interactions among individual tokens are usually dense and heavy, which limits their usability in understanding the complicated visual world.<\/p>\n\n\n\n<p>In this blog, we introduce our recent efforts on building neural networks with focal modulation, leading to the new architecture family:<strong> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2203.11926\">FocalNets<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong>. The highlight moments include: <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>FocalNet achieves new <strong>state-of-the-art<\/strong> (<strong>SoTA<\/strong>) on the most challenging vision task: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/paperswithcode.com\/sota\/object-detection-on-coco\" target=\"_blank\" rel=\"noopener noreferrer\">COCO object detection<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em>, with 3x small model size and training data size. &nbsp;This marks a milestone that the first attention-free model in the past two years to surpass all Transformer models on the leaderboard.<\/li><li><em>FocalNet exhibits an intriguing interpretable<\/em> <em>learning behavior<\/em>. It can discover and segment objects in an image or a video, while Transformer can hardly do. As the following example shows, the modulation focus maps gradually change from the early, middle to the final stage of perception, which are intuitively interpretable. This suggests FocalNet is capable of different levels of image understanding.<\/li><\/ul>\n\n\n\n<p class=\"has-text-align-center\"><img loading=\"lazy\" decoding=\"async\" width=\"2488\" height=\"1763\" class=\"wp-image-894780\" style=\"width: 300px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp.png\" alt=\"SoTA comparisons on COCO object detection\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp.png 2488w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-300x213.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-1024x726.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-768x544.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-1536x1088.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-2048x1451.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/coco_sota_cmp-240x170.png 240w\" sizes=\"auto, (max-width: 2488px) 100vw, 2488px\" \/>    <img loading=\"lazy\" decoding=\"async\" width=\"914\" height=\"480\" class=\"wp-image-895209\" style=\"width: 400px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Blog_teaser.gif\" alt=\"modulation_focus\"><br><em><sub>(Left) Comparison with SoTA on COCO object detection. Circle size indicates the model size. (Right) Modulation focus maps at the early, middle, and final stages of visual perception with our FocalNet<\/sub><\/em><\/p>\n\n\n\n<p>We also released the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2203.11926\" target=\"_blank\" rel=\"noopener noreferrer\">paper on arXiv<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,  <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/FocalNet\">PyTorch codebase on the project GitHub page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/huggingface.co\/spaces\/jw2yang\/focalnet-modulators\">HuggingFace demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Feel feel to give it a try.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-download\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/arxiv.org\/abs\/2203.11926\">ArXiv Paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/huggingface.co\/spaces\/jw2yang\/focalnet-modulators\">HuggingFace Demo<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/github.com\/microsoft\/FocalNet\">Github Code<\/a><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 id=\"eye-focusing-with-focal-modulation-networks\">Eye focusing with Focal Modulation Networks<\/h3>\n\n\n\n<p>At the core of Focal Modulation Networks (FocalNets) is the <em>focal modulation<\/em> mechanism: A lightweight element-wise multiplication as the focusing operator to allow the model to see or interact with the input using the proposed <em>modulator<\/em>; &nbsp;As depicted below, the modulator is computed with a focal aggregation procedure in two steps: <em>focal contextualization<\/em> to extract contexts from local to global ranges at different levels of granularity and <em>gated aggregation<\/em> to condense all context features at different granularity levels into the modulator.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig3-FocalNet.gif\" alt=\"The illustration of focal modulation process and the constructed FocalNet\" class=\"wp-image-894879\" width=\"671\" height=\"360\" \/><figcaption><em>The illustration of focal modulation process and the constructed FocalNet<\/em><\/figcaption><\/figure>\n\n\n\n<h3 id=\"focal-modulation-vs-self-attention\">Focal Modulation <em>vs<\/em> Self-Attention<\/h3>\n\n\n\n<p><em>Similar goals, but different focusing process<\/em>es. Focal modulation and self-attention are two different ways to enable AI models to selectively focus on certain parts of their input. The self-attention starts with interaction and then aggregation, while the focal modulation starts with aggregation then interaction, which significantly ease the process with much light-weight operations. <\/p>\n\n\n\n<p class=\"has-text-align-center\"><br><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"480\" class=\"wp-image-894891\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig1-FM-Animation.gif\" alt=\"a close up of a dog\">         vs        <img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"480\" class=\"wp-image-895023\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig1-SA-Animation-63620493ea519.gif\" alt=\"SA-Animation\"><\/p>\n\n\n\n<p><em>Modulation Map vs Attention Map.<\/em> Both methods learn to focus, but the selected focus areas are quite different. With the standard supervised training of FocalNet and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2010.11929\" target=\"_blank\" rel=\"noopener noreferrer\">Vision Transformers (ViT)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on ImageNet, we visualize the modulation map of FocalNet and the attention map of ViT, respectively. We observe that our focal modulation automatically learns an interpretable representation and separates the main object from the background clutter. It learns to segment objects without any form of dedicated dense pixel-level supervision, and the selected focus areas are coherent with the human-generated annotation in the image classification task. In contrast, the selected focus areas of attention maps in ViT are less meaningful and may highlight some spuriously correlated regions.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-1024x528.png\" alt=\"modulation map and attention map\" class=\"wp-image-895167\" width=\"777\" height=\"401\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-1024x528.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-300x155.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-768x396.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-1536x791.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-2048x1055.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/Fig-Attention_vs_Modulation-6362e163efa4d-240x124.png 240w\" sizes=\"auto, (max-width: 777px) 100vw, 777px\" \/><figcaption>From top to bottom:<em> Original image, Modulation map, and Attention Map<\/em> (Images are from ImageNet-1K validation set)<\/figcaption><\/figure>\n\n\n\n<p><br>When visualizing the modulation maps in the network for videos, we see that they correspond to coherent semantic regions of the moving objects.<\/p>\n\n\n\n<p class=\"has-text-align-center\"> <img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"434\" class=\"wp-image-895062\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/mallard-fly.gif\" alt=\"mallard-fly\">            <img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"434\" class=\"wp-image-894927\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/goat.gif\" alt=\"modulation map of a goat\"><\/p>\n\n\n\n<p class=\"has-text-align-center\"> <img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"434\" class=\"wp-image-894999\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/surf.gif\" alt=\"surf_focalnet\">            <img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"434\" class=\"wp-image-894990\" style=\"width: 350px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/paragliding.gif\" alt=\"paragliding_focalnet\"><\/p>\n\n\n\n<blockquote class=\"wp-block-quote has-text-align-center is-layout-flow wp-block-quote-is-layout-flow\"><p>I am excited about our new way of enabling AI to focus on the right parts of the input through focal modulation.<\/p><cite><em>\u2014 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/johannes\/\" target=\"_blank\" rel=\"noreferrer noopener\">Johannes Gehrke<\/a>, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)<\/em><\/cite><\/blockquote>\n\n\n\n<h3 id=\"dense-prediction-tasks-with-high-resolution-images\">Dense Prediction Tasks with High-Resolution Images<\/h3>\n\n\n\n<p>FocalNet is compared against established vision backbone networks, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2010.11929\" target=\"_blank\" rel=\"noopener noreferrer\">Vision Transformers (ViT)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2103.14030\" target=\"_blank\" rel=\"noopener noreferrer\">Swin Transformers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2201.03545\" target=\"_blank\" rel=\"noopener noreferrer\">ConvNeXt<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on different CV tasks, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/paperswithcode.com\/sota\/image-classification-on-imagenet\" target=\"_blank\" rel=\"noopener noreferrer\">ImageNet classification<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/1832\/leaderboard\/4298\" target=\"_blank\" rel=\"noopener noreferrer\">zero-shot classification on 20 datasets on ICinW<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and dense prediction tasks such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/paperswithcode.com\/sota\/object-detection-on-coco\" target=\"_blank\" rel=\"noopener noreferrer\">object detection<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/paperswithcode.com\/sota\/semantic-segmentation-on-ade20k\" target=\"_blank\" rel=\"noopener noreferrer\">segmentation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. FocalNet consistently outperforms others. The attention-free design of focal modulation can particularly benefit the dense visual prediction tasks with a high-resolution image input, as it allows the model to see a wider scope at different granularities and avoid the heavy burden of token-to-token interaction. Importantly, it achieves a new SoTA <strong>64.3<\/strong> (test-dev) \/ <strong>64.2<\/strong> (minival) on COCO object detection, outperforming the prior arts Swin-v2 Giant and BEIT-3 models with 3x smaller model\/data size.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"331\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-1024x331.png\" alt=\"FocalNet consistently shows superior performance on a wide set of computer vision problems\" class=\"wp-image-896055\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-1024x331.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-300x97.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-768x248.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-1536x497.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-2048x662.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/fig_net_cmp-6365e72165698-240x78.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption><em>FocalNet consistently shows superior performance on a wide set of computer vision problems<\/em><\/figcaption><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote has-text-align-center is-layout-flow wp-block-quote-is-layout-flow\"><p>Glad to continue to push on this state-of-the-art computer vision innovation to delight our worldwide Azure Cognitive Services customers.<\/p><cite><em>\u2014<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xdh\/\" target=\"_blank\" rel=\"noreferrer noopener\">Xuedong Huang<\/a>, Microsoft Technical Fellow and Chief Technology Officer of <em>Azure AI<\/em><\/cite><\/blockquote>\n\n\n\n<h3 id=\"from-the-broader-view-of-cognitive-and-neuroscience\">From the Broader View of Cognitive and Neuroscience <\/h3>\n\n\n\n<p><em>FocalNets mimic human vision<\/em>. In humans, attention is critical to our ability to focus on specific aspects of environmental stimuli while filtering out other irrelevant information. By definition, visual attention plays a key role in isolating the foreground from the background. Not surprisingly, an algorithm mimicking attention is critical for object recognition in computer vision. Visual attention can be roughly classified into two large categories: feature attention vs spatial attention (<em>e.g.<\/em> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0896627305006148\" target=\"_blank\" rel=\"noopener noreferrer\">Hayden and Gallant, Neuron 2005<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>; <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0896627315008673\">Bichot et al, Neuron 2015<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). Spatial attention directs the movement of eyes to specific locations and therefore is closely linked to the gaze control system. The existing Self-attention (SA) network appears more in line with the spatial attention mechanism of the brain. However, in many cases, we do not know where the object is located or where to focus, but we know it has distinct features. Feature-based attention therefore operates across the visual field and is not closely connected to the eye movement system. Its goal is to construct and maintain an internal representation of the target. Furthermore, in natural human vision, spatial attention and feature attention work together. Importantly, while most studies of visual attention focus on the cortex, it is also well-recognized that the pulvinar nucleus of the thalamus interacts with the cortex and plays a critical role in selective attention. Patients with lesions of the pulvinar nucleus have difficulties in filtering out distractors during attention tasks (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.pnas.org\/doi\/abs\/10.1073\/pnas.0810086106\" target=\"_blank\" rel=\"noopener noreferrer\">Snow JC, et al PNAS2009<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/p>\n\n\n\n<blockquote class=\"wp-block-quote has-text-align-center is-layout-flow wp-block-quote-is-layout-flow\"><p>The new algorithm FocalNet appears to better mimic the feature attention system, and hence it performs better in segmenting object from background. This superb ability of FocalNet could be mimicking the dynamic interactions between pulvinar and cortex<\/p><cite>&#8212; <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/mcgovern.mit.edu\/profile\/fan-wang\/\" target=\"_blank\" rel=\"noopener noreferrer\">Fan Wang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, Professor of Brain and Cognitive Sciences, Massachusetts Institute of Technology<\/cite><\/blockquote>\n\n\n\n<p><em>Focal modulation shares some similar structures with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Interneuron\" target=\"_blank\" rel=\"noopener noreferrer\">interneurons <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>in neural system<\/em>. (1) One example is the spinal cord: painful information is transmitted to spinal cord, and the projection neurons are the minority of the neurons, most neurons in the dorsal horn are interneurons that process and integrate information and control whether or not painful information is transmitted to higher centers. (2) In motor control, there&#8217;s the top-down command, and there&#8217;s the final motor neuron output, but for efficient motor control, there are also existing &#8220;modules&#8221; formed by premotor interneurons that can generate stereotypical patterns such as rhythms and sequences. It makes sense to make interneuron &#8220;modules&#8221; to specialize in certain processes, and the top-down control can then just play a role in orchestrating these modules. (3) In the somatosensory (body sensory system), while itch and pain are two distinct sensations, the peripheral sensory neurons that detect &#8220;itchy&#8221; or &#8220;painful&#8221; stimuli are not so distinct, many of these sensory neurons express &#8220;sensor&#8221; (receptors) for both itch-inducing and pain-inducing stimuli. The interneurons in the spinal cord play a key role in processing the &#8220;ambiguous&#8221; incoming information and separate into subsequent &#8220;itch&#8221; vs &#8220;pain&#8221; pathways.<\/p>\n\n\n\n<h3 id=\"a-new-building-block-for-the-next-generation-ai-models\">A new building block for the next-generation AI models<\/h3>\n\n\n\n<p>With FocalNets, the AI research community can build new computer vision systems for high-resolution visual inputs more efficiently. We hope that our experiments will show the community the potential of FocalNets and encourage further adoption of focal modulation.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><em>Acknowledgment: This research was conducted by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/jwyang.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Jianwei Yang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/chunyuan.li\/\" target=\"_blank\" rel=\"noopener noreferrer\">Chunyuan Li<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/scholar.google.com\/citations?user=QC8RwcoAAAAJ\" target=\"_blank\" rel=\"noopener noreferrer\">Xiyang Dai<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/scholar.google.com\/citations?user=k9TsUVsAAAAJ\" target=\"_blank\" rel=\"noopener noreferrer\">Lu Yuan<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" target=\"_blank\" rel=\"noreferrer noopener\"> Jianfeng Gao<\/a>. The connections to human vision and neuroscience are drawn by <em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/mcgovern.mit.edu\/profile\/fan-wang\/\">Fan Wang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/scholar.google.com\/citations?user=jRodyzgAAAAJ\" target=\"_blank\" rel=\"noopener noreferrer\">Jinghao Lu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from MIT. Additional thanks go to the Microsoft Research Horizontal AI Team and Microsoft Alexander Multi-modal team for providing computer resources for large-scale training. We would like to thank DINO team from IDEA, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.leizhang.org\/\">Lei Zhang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/scholar.google.com\/citations?user=B8hPxMQAAAAJ&hl=en\">Hao Zhang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/fengli-ust.github.io\/\">Feng Li<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.lsl.zone\/\">Shilong Liu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, for helpful discussions and detailed instructions of using DINO for object detection. We would like to thank <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ashkamath.github.io\/\">Aishwarya Kamath<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from NYU for sharing the Object365v2 dataset. We would like to thank Lingchen Meng for helping convert contrastive denoising into regular denoising in DINO.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Human eyes have a dynamic focusing system that adjusts the focal regions in order to see the surroundings at all distances. When we look far away, up close, and back again, our eyes change focus rapidly to allow us to perceive things finely and coarsely. In computer vision (CV), It remains an open question how [&hellip;]<\/p>\n","protected":false},"author":37971,"featured_media":894780,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":144931,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-894768","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":144931,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/894768","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37971"}],"version-history":[{"count":32,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/894768\/revisions"}],"predecessor-version":[{"id":896061,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/894768\/revisions\/896061"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/894780"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=894768"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=894768"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=894768"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=894768"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}