{"id":212087,"date":"2016-01-14T20:03:50","date_gmt":"2016-01-14T20:03:50","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/video-and-language\/"},"modified":"2021-05-13T02:20:33","modified_gmt":"2021-05-13T09:20:33","slug":"video-and-language","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/video-and-language\/","title":{"rendered":"Vision and Language"},"content":{"rendered":"<p class=\"asset-content\">Automatically describing visual content with natural language is a fundamental challenge of computer vision and multimedia. Sequence learning (e.g., Recurrent Neural Networks), attention mechanism, memory networks,\u00a0etc.,\u00a0have attracted increasing attention on visual interpretation. In this project, we are focusing on the following topics related to the emerging topic of &#8220;vision and language&#8221;:<\/p>\n<ul>\n<li class=\"asset-content\">Image and video captioning, including MSR-VTT video to language grand challenge and datasets (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/ms-multimedia-challenge.com\/\">http:\/\/ms-multimedia-challenge.com\/<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/li>\n<li class=\"asset-content\">Image and video commenting (conversation)<\/li>\n<li class=\"asset-content\">Visual storytelling (e.g., generation of paragraph and\u00a0poem)<\/li>\n<li class=\"asset-content\">Visual question-answering (VQA)<\/li>\n<\/ul>\n<p class=\"asset-content\">In our paper published in CVPR 2016,\u00a0we present a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. In our paper accepted to CVPR 2017, we propose to incorporate copying mechanism in image captioning for learning novel objects. We also present\u00a0the Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA) \u2014 a novel deep architecture that incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner.\u00a0For visual question-answering, in our work accepted to CVPR 2017, we propose a multi-level joint attention network (MLAN) for visual question answering that can simultaneously reduce semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention.<\/p>\n<p class=\"asset-content\" style=\"text-align: left\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-233231 alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/01\/video_and_language_sentence-1024x435.png\" alt=\"video_and_language_sentence\" width=\"816\" height=\"332\" \/><br \/>\nFigure. LSTM-E [Pan et al., CVPR&#8217;16]<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-385373 alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/01\/LSTM-TSA-1024x636.jpg\" alt=\"\" width=\"610\" height=\"381\" \/><br \/>\nFigure of LSTM-TSA [Pan et al., CVPR&#8217;17]<\/p>\n<p><strong>Microsoft Research Video to Language Grand Challenge<\/strong> (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/ms-multimedia-challenge.com\/\">http:\/\/ms-multimedia-challenge.com\/<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>)<\/p>\n<p>Video has become ubiquitous on the Internet, broadcasting channels, as well as personal devices. This has encouraged the development of advanced techniques to analyze the semantic video content for a wide variety of applications. Recognition of videos has been a fundamental challenge of computer vision for decades. Previous research has predominantly focused on recognizing videos with a predefined yet very limited set of individual words. In this grand challenge, we go a step further and target for translating video content to a complete and natural sentence. Given an input video clip, the goal is to automatically generate a complete and natural sentence to describe video content, ideally encapsulating its most informative dynamics.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Automatically describing visual content with natural language is a fundamental challenge of computer vision and multimedia. Sequence learning (e.g., Recurrent Neural Networks), attention mechanism, memory networks,\u00a0etc.,\u00a0have attracted increasing attention on visual interpretation. In this project, we are focusing on the following topics related to the emerging topic of &#8220;vision and language&#8221;: Image and video captioning, [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562,13551],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-212087","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-graphics-and-multimedia","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"1\/14\/2016","related-publications":[431376,820570,820558,820543,749767,749755,631809,631803,631794,511289,442398,437949,437919,215268,426129,418073,394943,394934,394163,306302,301679,271377,271203,238200,238199,238188],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Jianlong Fu","user_id":32260,"people_section":"Group 1","alias":"jianf"},{"type":"user_nicename","display_name":"Bei Liu","user_id":38889,"people_section":"Group 1","alias":"libei"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/212087","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/212087\/revisions"}],"predecessor-version":[{"id":745939,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/212087\/revisions\/745939"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=212087"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=212087"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=212087"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=212087"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=212087"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}