{"id":948273,"date":"2023-06-20T09:19:50","date_gmt":"2023-06-20T16:19:50","guid":{"rendered":""},"modified":"2023-06-21T06:41:27","modified_gmt":"2023-06-21T13:41:27","slug":"microsoft-at-cvpr-2023-pushing-the-boundaries-of-computer-vision","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-at-cvpr-2023-pushing-the-boundaries-of-computer-vision\/","title":{"rendered":"Microsoft at CVPR 2023: Pushing the boundaries of computer vision"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"264\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH.jpg\" alt=\"Logo for the CVPR 2023 conference showing the Vancouver, British Columbia skyline with the conference dates, June 18\u201323, 2023. In the background, there is a faded photo of the city of Vancouver on a sunny day.\" class=\"wp-image-948327\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH-300x57.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH-1024x193.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH-768x145.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/BlogBanner-1400x264_AH-240x45.jpg 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>In the vast realm of artificial intelligence, few fields have captivated our imagination and pushed the boundaries of possibility quite like <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/research-area\/computer-vision\/?\">computer vision<\/a>. At the core of this domain of research and innovation lies the ambition to empower technologies for real-world vision-based systems, enabling machines to take in and respond to visual stimuli with unparalleled precision and sophistication. Through the combination of AI, deep learning, and vast amounts of data, computer vision has made great strides in recent years, catapulting us into an era in which the seemingly impossible becomes achievable.<\/p>\n\n\n\n<p>The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cvpr2023.thecvf.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Computer Vision and Pattern Recognition<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;(CVPR) 2023, held June 10 through June 22, is a widely recognized event that brings together leading experts in the field of computer vision. It serves as a platform for showcasing some of the most compelling and innovative work in this domain.&nbsp;<\/p>\n\n\n\n<p>The contributions presented by Microsoft researchers and their collaborators at this year&#8217;s CVPR cover a wide spectrum of research endeavors. From generative models and network pretraining to sign language understanding and neural video codecs, these cutting-edge advancements underscore the evolving capabilities of systems to analyze and extract valuable insights from visual data.<\/p>\n\n\n\n<p>Here are some of the highlights (see below for a list of published papers and their authors):&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"uniting-vision-language-and-multi-modal-encoding\">Uniting vision, language, and multi-modal encoding<\/h2>\n\n\n\n<p>The paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/image-as-a-foreign-language-beit-pretraining-for-vision-and-vision-language-tasks\/\">Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks<\/a>\u201d lies at the intersection of vision, language, and multimodal pretraining. To learn from these different forms of data, we present a general-purpose foundational model that treats images as a &#8220;foreign language.&#8221; The data from different modalities are encoded with Multiway Transformers, a modular architecture that enables modality-specific encoding and deep fusion. The model is pretrained on images, text, and image-text pairs in a way that generalizes the masked language modeling approach to different modalities. By substantially scaling the model and data, we found that these advances in foundational architecture and pretraining lead to excellent transfer performance over a variety of vision and vision-language tasks, including object detection, semantic segmentation, image classification, visual reasoning, visual question answering, image captioning, and cross-modal image retrieval.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scaling-training-data-for-large-vision-models\">Scaling training data for large vision models<\/h2>\n\n\n\n<p>The strength of large language models stems from their ability to leverage unlabeled training data on a massive scale. By using this data, these models acquire a broad understanding of language, enhance their generalization abilities, and improve their performance across a wide range of language-related tasks. Inspired by this achievement, our research focuses on the possibilities of scaling training data for large vision models. In the paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/on-data-scaling-in-masked-image-modeling\/\">On Data Scaling in Masked Image Modeling<\/a>,\u201d we explore the effects of data scaling on large vision models that are pretrained through masked image modeling. Through extensive investigation, we discovered that masked image modeling in large vision models requires large-scale data for effective pretraining. However, unlike large language models, large vision models cannot benefit from more data in a non-overfitting scenario. These findings deepen our understanding of masked image modeling and may pave the way for future advancements in large-scale vision models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"creating-3d-avatars-with-a-diffusion-network\">Creating 3D avatars with a diffusion network<\/h2>\n\n\n\n<p>In the world of image generation, incredible strides have been made in transforming text descriptions into stunning visuals. The rise of DALL-E and diffusion models has brought these cutting-edge tools into the hands of everyday users. In the paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/rodin-a-generative-model-for-sculpting-3d-digital-avatars-using-diffusion\/\">RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion<\/a>,\u201d we expand on this innovation by introducing the power of diffusion to 3D avatar generation. To do this, it is necessary to transfer diffusion from 2D to 3D. However, transferring diffusion from 2D to 3D is a significant challenge due to the prohibitive memory and processing costs for producing high-quality results with rich details in 3D. We overcome this problem by proposing the roll-out diffusion network (RODIN), which unrolls a 3D neural radiance field into a single 2D feature plane and performs 3D-aware diffusion on it. Supported by other technical contributions, including latent conditioning to promote global coherence and hierarchical synthesis to further enhance details, RODIN significantly accelerates the otherwise tedious 3D modeling process and opens new opportunities for 3D artists.<\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"670821\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Microsoft research newsletter<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/Newsletter_Banner_08_2019_v1_1920x1080.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Newsletter<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-newsletter\" class=\"large\">Stay connected to the research community at Microsoft.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button is-style-fill-chevron\">\n\t\t\t\t\t\t<a href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-describedby=\"microsoft-research-newsletter\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t\t\tSubscribe today\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>Microsoft papers published at CVPR 2023 with their authors:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/3d-human-mesh-estimation-from-virtual-markers\/\">3D Human Mesh Estimation from Virtual Markers<\/a><br>Xiaoxuan Ma, <em>Peking University<\/em>; Jiajun Su, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chnuwa\/\">Chunyu Wang<\/a>, <em>Microsoft Research<\/em>; Wentao Zhu, <em>Peking University<\/em>; Yizhou Wang, <em>Peking University<\/em> and <em>National Engineering Research Center of Visual Technology<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/limap-global-mapper-3d-line-mapping-revisited\/\">3D Line Mapping Revisited<\/a><br>Shaohui Liu, <em>ETH Zurich<\/em>; Yifan Yu,<em> ETH Zurich<\/em>; R\u00e9mi Pautrat,<em> ETH Zurich<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mapoll\/\">Marc Pollefeys<\/a>, <em>ETH Zurich and Microsoft Research<\/em>; Viktor Larsson, <em>Lund University<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/blendfields-few-shot-example-driven-facial-modeling\/\">BlendFields: Few-Shot Example-Driven Facial Modeling<\/a><br>Kacper Kania, <em>Warsaw University of Technology<\/em>; Stephan J. Garbin, <em>Microsoft Research<\/em>; Andrea Tagliasacchi, <em>Simon Fraser and University and Google Brain<\/em>; Virginia Estellers, <em>Microsoft Research<\/em>; Kwang Moo Yi, <em>University of British Columbia<\/em>; Julien Valentin, <em>Microsoft Research<\/em>; Tomasz Trzci\u0144ski, <em>Jagiellonian University<\/em>; Marek Kowalski, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cico-domain-aware-sign-language-retrieval-via-cross-lingual-contrastive-learning\/\">CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning<br><\/a>Yiting Cheng, <em>Fudan University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jianbao\/\">Jianmin Bao<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/doch\/\">Dong Chen<\/a>, <em>Microsoft Research<\/em>; Wenqiang Zhang, <em>Fudan University<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-frequency-filtering-for-domain-generalization\/\">Deep Frequency Filtering for Domain Generalization<\/a><br>Shiqi Lin, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhizzhang\/\">Zhizheng Zhang<\/a>, <em>Microsoft Research<\/em>; Zhipeng Huang, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/culan\/\">Cuiling Lan<\/a>, <em>Microsoft Research<\/em>; Peng Chu, <em>Microsoft<\/em>; Quanzeng You,<em> Microsoft<\/em>; Jiang Wang,<em> Microsoft<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zliu\/\">Zicheng Liu<\/a>,<em> Microsoft Research<\/em>; Amey Parulkar,<em> Microsoft<\/em>; Viraj Navkal,<em> Microsoft<\/em>; Zhibo Chen,<em> University of Science and Technology of China<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deeplsd-line-segment-detection-and-refinement-with-deep-image-gradients\/\">DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients<\/a><br>R\u00e9mi Pautrat, <em>ETH Zurich<\/em>; Daniel Barath, <em>ETH Zurich<\/em>; Viktor Larsson, <em>Lund University; <\/em>Martin R. Oswald, <em>University of Amsterdam<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mapoll\/\">Marc Pollefeys<\/a>, <em>ETH Zurich and Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/detrs-with-hybrid-matching\/\">DETRs with Hybrid Matching<br><\/a>Ding Jia, <em>Peking University<\/em>; Yuhui Yuan, <em>Microsoft Research<\/em>; Haodi He, <em>Stanford University<\/em>; Xiaopei Wu, <em>Zhejiang University<\/em>; Haojun Yu, <em>Peking University<\/em>; Weihong Lin, <em>Microsoft Research<\/em>; Lei Sun, <em>Microsoft Research<\/em>; Chao Zhang, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/efficientvit-memory-efficient-vision-transformer-with-cascaded-group-attention\/\">EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention<br><\/a>Xinyu Liu, <em>Chinese University of Hong Kong<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hopeng\/\">Houwen Peng<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nizhen\/\">Ningxin Zheng<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yuqyang\/\">Yuqing Yang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em>; Yixuan Yuan, <em>Chinese University of Hong Kong<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/four-view-geometry-with-unknown-radial-distortion\/\">Four-View Geometry with Unknown Radial Distortion<\/a><br>Petr Hruby, Viktor Korotynskiy, Timothy Duff, Luke Oeding, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mapoll\/\">Marc Pollefeys<\/a>, <em>ETH Zurich and Microsoft Research<\/em>; Tomas Pajdla, Viktor Larsson, <em>Lund University<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/high-fidelity-and-freely-controllable-talking-head-video-generation\/\">High-Fidelity and Freely Controllable Talking Head Video Generation<\/a><br><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yuegao\/\">Yue Gao<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhouyuan\/\">Yuan Zhou<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jinglwa\/\">Jinglu Wang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xili11\/\">Xiao Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiangming\/\">Xiang Ming<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a><em>, Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/human-pose-as-compositional-tokens\/\">Human Pose as Compositional Tokens<\/a><br>Zigang Geng, <em>University of Science and Technology of China <\/em>and <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chnuwa\/\">Chunyu Wang<\/a>, <em>Microsoft Research<\/em>; Yixuan Wei, <em>Tsinghua University<\/em> <em>and<\/em> <em>Microsoft Research<\/em>; Ze Liu, <em>University of Science and Technology of China and<\/em> <em>Microsoft Research<\/em>; Houqiang Li, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/iclip-bridging-image-classification-and-contrastive-language-image-pre-training-for-visual-recognition\/\">iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition<\/a><br>Yixuan Wei, <em>Tsinghua University and Microsoft Research<\/em>; Yue Cao,<em> Microsoft Research<\/em>; Zheng Zhang,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hopeng\/\">Houwen Peng<\/a>, <em>Microsoft Research<\/em>; Zhuliang Yao,<em> Tsinghua University and Microsoft Research<\/em>; Zhenda Xie,<em> Tsinghua University and Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bainguo\/\">Baining Guo<\/a>,<em> Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/image-as-a-foreign-language-beit-pretraining-for-all-vision-and-vision-language-tasks\/\">Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks<br><\/a>Wenhui Wang, <em>Microsoft;<\/em> Hangbo Bao,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lidong1\/\">Li Dong<\/a>, <em>Microsoft Research;<\/em> Johan Bjorck,<em> Microsoft;<\/em> Zhiliang Peng,<em> Microsoft;<\/em> Qiang Liu,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kragga\/\">Kriti Aggarwal<\/a>,<em> Microsoft Research;<\/em> Owais Khan Mohammed,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/saksingh\/\">Saksham Singhal<\/a>,<em> Microsoft Research;<\/em> Subhojit Som,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fuwei\/\">Furu Wei<\/a>,<em> Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/iterative-proposal-refinement-for-weakly-supervised-video-grounding\/\">Iterative Proposal Refinement for Weakly-Supervised Video Grounding<\/a><br>Meng Cao, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/caxu\/\">Can Xu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xigeng\/\">Xiubo Geng<\/a>, <em>Microsoft Research<\/em>; Long Chen, <em>Hong Kong University of Science and Technology<\/em>; Can Zhang,<em> Peking University;<\/em> Yuexian Zou,<em> Peking University;<\/em> Tao Shen, <em>Microsoft<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/djiang\/\">Daxin Jiang<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/layoutformer-conditional-graphic-layout-generation-via-constraint-serialization-and-decoding-space-restriction\/\">LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction<\/a>&nbsp;<br>Zhaoyun Jiang, <em>Xi\u2019an Jiaotong University<\/em>; Jiaqi Guo, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shizsu\/\">Shizhao Sun<\/a>, <em>Microsoft Research<\/em>; Huayu Deng, <em>Shanghai Jiaotong University<\/em>; Zhongkai Wu, <em>Beihang University<\/em>; Vuksan Mijovic, <em>Microsoft<\/em>; Zijiang James Yang, <em>Xi\u2019an Jiaotong University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jlou\/\">Jian-Guang Lou<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dongmeiz\/\">Dongmei Zhang<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-to-exploit-temporal-structure-for-biomedical-vision-language-processing\/\">Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing<\/a><br><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shbannur\/\">Shruthi Bannur<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sthyland\/\">Stephanie Hyland<\/a>, <em>Microsoft Research<\/em>; Qianchu Liu, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fperezgarcia\/\">Fernando P\u00e9rez Garc\u00eda<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/maxilse\/\">Maximilian Ilse<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dacoelh\/\">Daniel C. Castro<\/a>,<em> Microsoft Research<\/em>; Benedikt Boecking, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/harssharma\/\">Harshita Sharma<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/t-kbouzid\/\">Kenza Bouzid<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/anthie\/\">Anja Thieme<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/antonsc\/\">Anton Schwaighofer<\/a>,<em> Microsoft Research<\/em>; Maria Wetscherek, Matthew P. Lungren, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adityan\/\">Aditya Nori,<\/a><em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jaalvare\/\">Javier Alvarez-Valle<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ozoktay\/\">Ozan Oktay<\/a> <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/look-before-you-match-instance-understanding-matters-in-video-object-segmentation\/\">Look Before You Match: Instance Understanding Matters in Video Object Segmentation<\/a><br>Junke Wang, <em>Shanghai Collaborative Innovation Center on Intelligent Visual Computing<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dochen\/\">Dongdong Chen<\/a>, <em>Microsoft Research<\/em>; Zuxuan Wu, <em>Shanghai Collaborative Innovation Center on Intelligent Visual Computing; <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/cluo\/\">Chong Luo<\/a>, <em>Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chutan\/\">Chuanxin Tang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xidai\/\">Xiyang Dai<\/a>, <em>Microsoft Research<\/em>; Yucheng Zhao, <em>Microsoft Research<\/em>; Yujia Xie, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/luyuan\/\">Lu Yuan<\/a>, <em>Microsoft Research<\/em>; Yu-Gang Jiang, <em>Shanghai Collaborative Innovation Center on Intelligent Visual Computing<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/maskclip-masked-self-distillation-advances-contrastive-language-image-pretraining\/\">MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining<\/a><br>Xiaoyi Dong, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jianbao\/\">Jianmin Bao<\/a>,<em> Microsoft Research<\/em>; Yinglin Zheng, <em>Xiamen University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tinzhan\/\">Ting Zhang<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dochen\/\">Dongdong Chen<\/a>,<em> Microsoft Research<\/em>; Hao Yang, <em>Microsoft Research<\/em>; Ming Zeng, <em>Xiamen University<\/em>; Weiming Zhang, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/luyuan\/\">Lu Yuan<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/doch\/\">Dong Chen<\/a>, <em>Microsoft Research<\/em>; Fang Wen, <em>Microsoft Research<\/em>; Nenghai Yu, <em>University of Science and Technology of China<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/metaportrait-identity-preserving-talking-head-generation-with-fast-personalized-adaptation\/\">MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation<br><\/a>Bowen Zhang, <em>USTC<\/em>; Chenyang Qi, <em>HKUST<\/em>; Pan Zhang, <em>USTC<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhanbo\/\">Bo Zhang<\/a>, <em>Microsoft Research<\/em>; HsiangTao Wu, <em>Microsoft<\/em>; Dong Chen, <em>HKUST<\/em>; Qifeng Chen, <em>HKUST<\/em>; Yong Wang, <em>USTC<\/em>; Fang Wen, <em>Microsoft<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mm-diffusion-learning-multi-modal-diffusion-models-for-joint-audio-and-video-generation\/\">MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation<\/a><br>Ludan Ruan, <em>Renmin University of China<\/em>; Yiyang Ma, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/huayan\/\">Huan Yang<\/a>,<em> Microsoft Research<\/em>; Huiguo He,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/libei\/\">Bei Liu<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jianf\/\">Jianlong Fu<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nicholasyuan\/\">Nicholas Jing Yuan<\/a>,<em> Microsoft Research<\/em>; Qin Jin, <em>Renmin University of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bainguo\/\">Baining Guo<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/motion-information-propagation-for-neural-video-compression\/\">Motion Information Propagation for Neural Video Compression<\/a><br>Linfeng Qi, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jiahali\/\">Jiahao Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/libin\/\">Bin Li<\/a>, <em>Microsoft Research<\/em>; Houqiang Li, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/natural-language-assisted-sign-language-recognition\/\">Natural Language-Assisted Sign Language Recognition<br><\/a>Ronglai Zuo, <em>Hong Kong University of Science and Technology<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>, <em>Microsoft Research<\/em>; Brian Mak, <em>Hong Kong University of Science and Technology<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/neural-video-compression-with-diverse-contexts\/\">Neural Video Compression with Diverse Contexts<br><\/a><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jiahali\/\">Jiahao Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/libin\/\">Bin Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/on-data-scaling-in-masked-image-modeling\/\">On Data Scaling in Masked Image Modeling<\/a><br>Zhenda Xie, <em>Tsinghua University and Microsoft Research<\/em>; Zheng Zhang,<em> Microsoft Research<\/em>; Yue Cao,<em> Microsoft Research<\/em>; Yutong Lin, <em>Xi\u2019an Jiaotong University and Microsoft Research<\/em>; Yixuan Wei, <em>Tsinghua University and Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qid\/\">Qi Dai<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/paint-by-example-exemplar-based-image-editing-with-diffusion-models\/\">Paint by Example: Exemplar-based Image Editing with Diffusion Models<br><\/a>Binxin Yang, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shuyanggu\/\">Shuyang Gu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhanbo\/\">Bo Zhang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tinzhan\/\">Ting Zhang<\/a>, <em>Microsoft Research<\/em>; Xuejin Chen, <em>University of Science and Technology of China<\/em>; Xiaoyan Sun, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/doch\/\">Dong Chen<\/a>, <em>Microsoft Research<\/em>; Fang Wen, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/reco-region-controlled-text-to-image-generation\/\">ReCo: Region-Controlled Text-to-Image Generation<\/a><br><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/zyang-ur.github.io\/\">Zhengyuan Yang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,<em> Microsoft Research;<\/em> Jianfeng Wang, <em>Microsoft;<\/em> Zhe Gan,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/linjli\/\">Linjie Li<\/a>,<em> Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/keli\/\">Kevin Lin<\/a>,<em> Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chewu\/\">Chenfei Wu<\/a>,<em> Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nanduan\/\">Nan Duan<\/a>,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zliu\/\">Zicheng Liu<\/a>,<em> Microsoft Research;<\/em> Ce Liu,<em> Microsoft;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nzeng\/\">Michael Zeng<\/a>,<em> Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\">Lijuan Wang<\/a><em>, Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/resformer-scaling-vits-with-multi-resolution-training\/\">ResFormer: Scaling ViTs with Multi-Resolution Training<br><\/a>Rui Tian,<em> Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing<\/em>;&nbsp;Zuxuan Wu, <em>Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qid\/\">Qi Dai<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em>; Yu Qiao,<em>Shanghai AI Laboratory<\/em>; Yu-Gang Jiang, <em>Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/revealing-the-dark-secrets-of-masked-image-modeling\/\">Revealing the Dark Secrets of Masked Image Modeling<br><\/a>Zhenda Xie, <em>Tsinghua University and Microsoft Research<\/em>; Zigang Geng, <em>University of Science and Technology of China and Microsoft Research<\/em>; Jingcheng Hu, <em>Tsinghua University and Microsoft Research<\/em>; Zheng Zhang, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em>; Yue Cao, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/rodin-a-generative-model-for-sculpting-3d-digital-avatars-using-diffusion\/\">RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion<\/a><br>Tengfei Wang, <em>Hong Kong University of Science and Technology<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhanbo\/\">Bo Zhang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tinzhan\/\">Ting Zhang<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shuyanggu\/\">Shuyang Gu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jianbao\/\">Jianmin Bao<\/a>,<em> Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tabaltru\/\">Tadas Baltrusaitis<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jinshen\/\">Jingjing Shen<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/doch\/\">Dong Chen<\/a>,<em> Microsoft Research<\/em>; Fang Wen,<em> Microsoft Research<\/em>; Qifeng Chen,<em> Hong Kong University of Science and Technology;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bainguo\/\">Baining Guo<\/a>,<em> Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/seqtrack-sequence-to-sequence-learning-for-visual-object-tracking\/\">SeqTrack: Sequence to Sequence Learning for Visual Object Tracking<br><\/a>Xin Chen, <em>Dalian University of Technology<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hopeng\/\">Houwen Peng<\/a>, <em>Microsoft Research<\/em>; Dong Wang, <em>Dalian University of Technology<\/em>; Huchuan Lu, <em>Dalian University of Technology and Peng Cheng Laboratory<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/side-adapter-network-for-open-vocabulary-semantic-segmentation\/\">Side Adapter Network for Open-Vocabulary Semantic Segmentation<\/a><br>Mengde Xu, <em>Huazhong University of Science and Technology and Microsoft Research<\/em>; Zheng Zhang,<em> Huazhong University of Science and Technology and Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>,<em> Microsoft Research;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>,<em> Microsoft Research;<\/em> Xiang Bai; <em>Huazhong University of Science and Technology<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/streaming-video-model\/\">Streaming Video Model<br><\/a>Yucheng Zhao, <em>University of Science and Technology of China<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/cluo\/\">Chong Luo<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chutan\/\">Chuanxin Tang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dochen\/\">Dongdong Chen,<\/a> <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ncodella\/\">Noel Codella<\/a>, <em>Microsoft Research<\/em>; Zheng-Jun Zha, <em>University of Science and Technology of China<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/structural-multiplane-image-bridging-neural-view-synthesis-and-3d-reconstruction\/\">Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction<br><\/a>Mingfang Zhang, <em>University of Tokyo and Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jinglwa\/\">Jinglu Wang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xili11\/\">Xiao Li<\/a>, <em>Microsoft Research<\/em>; Yifei Huang, <em>University of Tokyo<\/em>; Yoichi Sato, <em>University of Tokyo<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/svformer-semi-supervised-video-transformer-for-action-recognition\/\">SVFormer: Semi-supervised Video Transformer for Action Recognition<br><\/a>Zhen Xing, <em>Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qid\/\">Qi Dai<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>, <em>Microsoft Research<\/em>; Jingjing Chen, F<em>udan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; <\/em>Zuxuan Wu,<em> Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing;<\/em> Yu-Gang Jiang, <em>Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/tinymim-an-empirical-study-of-distilling-mim-pre-trained-models\/\">TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models<\/a><br>Sucheng Ren, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>, <em>Microsoft Research<\/em>; Zheng Zhang, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanhu\/\">Han Hu<\/a>,<em> Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/two-shot-video-object-segmentation\/\">Two-shot Video Object Segmentation<\/a><br>Kun Yan, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xili11\/\">Xiao Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fawe\/\">Fangyun Wei<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jinglwa\/\">Jinglu Wang<\/a>, <em>Microsoft Research<\/em>; Chenbin Zhang, <em>Peking University<\/em>; Ping Wang, <em>Peking University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/unifying-layout-generation-with-a-decoupled-diffusion-model\/\">Unifying Layout Generation with a Decoupled Diffusion Model<\/a><br>Mude Hui, <em>Xi\u2019an Jiaotong University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zhizzhang\/\">Zhizheng Zhang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaoyizhang\/\">Xiaoyi Zhang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/wenxie\/\">Wenxuan Xie<\/a>, <em>Microsoft Research<\/em>; Yuwang Wang, <em>Tsinghua University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/videotrack-learning-to-track-objects-via-video-transformer\/\">VideoTrack: Learning to Track Objects via Video Transformer<\/a><br>Fei Xie, <em>Shanghai Jiao Tong University<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/leichu\/\">Lei Chu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jiahali\/\">Jiahao Li<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yanlu\/\">Yan Lu<\/a>, <em>Microsoft Research<\/em>; Chao Ma, <em>Shanghai Jiao Tong University<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/volrecon-volume-rendering-of-signed-ray-distance-functions-for-generalizable-multi-view-reconstruction\/\">VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction<\/a><br>Yufan Ren, <em>EPFL<\/em>; Fangjinhua Wang<em> ETH Zurich<\/em>; Tong Zhang,<em> EPFL;<\/em> <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mapoll\/\">Marc Pollefeys<\/a>,<em> ETH Zurich and Microsoft Research; <\/em>Sabine S\u00fcsstrunk,<em> EPFL<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/x-avatar-expressive-human-avatars\/\">X-Avatar: Expressive Human Avatars<\/a><br>Kaiyue Shen, <em>ETH Zurich<\/em>; Chen Guo, <em>ETH Zurich<\/em>; Manuel Kaufmann, <em>ETH Zurich<\/em>; Juan Jose Zarate, <em>ETH Zurich<\/em>; Julien Valentin, <em>Microsoft Research<\/em>; Jie Song, <em>ETH Zurich<\/em>; Otmar Hilliges, <em>ETH Zurich<\/em><\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/unifying-vision-text-and-layout-for-universal-document-processing\/\" target=\"_blank\" rel=\"noreferrer noopener\">Unifying Vision, Text, and Layout for Universal Document Processing<\/a><br>Zineng Tang, <em>University of North Carolina (UNC) Chapel Hill<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ziyiyang\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ziyi Yang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/guow\/\" target=\"_blank\" rel=\"noreferrer noopener\">Guoxin Wang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yuwfan\/\" target=\"_blank\" rel=\"noreferrer noopener\">Yuwei Fang<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yaliu10\/\" target=\"_blank\" rel=\"noreferrer noopener\">Yang Liu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chezhu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Chenguang Zhu<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nzeng\/\" target=\"_blank\" rel=\"noreferrer noopener\">Michael Zeng<\/a>, <em>Microsoft Research<\/em>; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chazhang\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cha Zhang<\/a>, <em>Microsoft Research<\/em>; Mohit Bansal, <em>University of North Carolina (UNC) Chapel Hill<\/em><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>In the vast realm of artificial intelligence, few fields have captivated our imagination and pushed the boundaries of possibility quite like computer vision. At the core of this domain of research and innovation lies the ambition to empower technologies for real-world vision-based systems, enabling machines to take in and respond to visual stimuli with unparalleled [&hellip;]<\/p>\n","protected":false},"author":42183,"featured_media":948570,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Baining Guo","user_id":"31169"},{"type":"user_nicename","value":"Steve Lin","user_id":"33735"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-948273","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Baining Guo","user_id":31169,"display_name":"Baining Guo","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bainguo\/\" aria-label=\"Visit the profile page for Baining Guo\">Baining Guo<\/a>","is_active":false,"last_first":"Guo, Baining","people_section":0,"alias":"bainguo"},{"type":"user_nicename","value":"Steve Lin","user_id":33735,"display_name":"Steve Lin","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/stevelin\/\" aria-label=\"Visit the profile page for Steve Lin\">Steve Lin<\/a>","is_active":false,"last_first":"Lin, Steve","people_section":0,"alias":"stevelin"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Logo for the CVPR 2023 conference showing the Vancouver, British Columbia skyline with the conference dates, June 18\u201323, 2023. In the background, there is a faded photo of the city of Vancouver on a sunny day.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/CVPR2023_BlogFeatured-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bainguo\/\" title=\"Go to researcher profile for Baining Guo\" aria-label=\"Go to researcher profile for Baining Guo\" data-bi-type=\"byline author\" data-bi-cN=\"Baining Guo\">Baining Guo<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/stevelin\/\" title=\"Go to researcher profile for Steve Lin\" aria-label=\"Go to researcher profile for Steve Lin\" data-bi-type=\"byline author\" data-bi-cN=\"Steve Lin\">Steve Lin<\/a>","formattedDate":"June 20, 2023","formattedExcerpt":"In the vast realm of artificial intelligence, few fields have captivated our imagination and pushed the boundaries of possibility quite like computer vision. At the core of this domain of research and innovation lies the ambition to empower technologies for real-world vision-based systems, enabling machines&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/948273","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=948273"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/948273\/revisions"}],"predecessor-version":[{"id":951435,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/948273\/revisions\/951435"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/948570"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=948273"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=948273"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=948273"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=948273"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=948273"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=948273"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=948273"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=948273"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=948273"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=948273"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=948273"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}