{"id":657990,"date":"2020-05-15T09:03:10","date_gmt":"2020-05-15T16:03:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=657990"},"modified":"2020-05-19T09:10:09","modified_gmt":"2020-05-19T16:10:09","slug":"objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/","title":{"rendered":"Objects are the secret key to revealing the world between vision and language"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-658437 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Anime.gif\" alt=\"\" width=\"1400\" height=\"788\" \/><\/p>\n<p>Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse the information collected from multiple channels to grasp the key concepts needed for a better understanding of the world. One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data, similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and describing the content of an image using natural language.<\/p>\n<p>Recently, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/\">vision-and-language pre-training<\/a> (VLP) has shown great progress toward addressing this problem. The most representative approach is to train large <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Transformer-based<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> models on massive image-text pair data in a self-supervised manner, such as predicting the masked elements based on their context. The cross-modal representations of the pre-training models can be fine-tuned to adapt to various downstream vision-and-language tasks. However, existing VLP methods simply concatenate image region features and text features as input to the model for pre-training and use self-attention to learn image-text semantic alignments in a brute-force yet implicit manner, leaving the model to figure out the cross-modal alignment from scratch.<\/p>\n<p>In this blog post, we introduce <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">Oscar<\/a> (Object-Semantics Aligned Pre-training) to highlight our observation that objects can be naturally used as anchor points to ease the learning of semantic alignments between images and texts. This discovery leads to a novel VLP framework that creates new state-of-the-art performance on six well-established vision-and-language tasks. Please check out our paper on this technology, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">\u201cOscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks,\u201d<\/a> and explore the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/Oscar\">code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for more details.<\/p>\n<div>Oscar is one important piece of Microsoft\u2019s new <u><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" tabindex=\"-1\" title=\"https:\/\/aka.ms\/ai-at-scale-research\" href=\"https:\/\/aka.ms\/ai-at-scale-research\" target=\"_blank\" rel=\"noopener noreferrer\">AI at Scale<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/u> initiative to enable next-generation AI capabilities at scale. In this accompanying <u><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" tabindex=\"-1\" title=\"https:\/\/aka.ms\/aa87dvg\" href=\"https:\/\/aka.ms\/AA87dvg\" target=\"_blank\" rel=\"noopener noreferrer\">AI blog post<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/u>, learn more about how Oscar is being integrated with other technologies to create powerful AI that people can use in original and innovative ways.<\/div>\n<h3><\/h3>\n<h3>Object tags as anchor points<\/h3>\n<p>Though the observed data varies among different channels (modalities), we hypothesize that important factors tend to be shared among multiple channels (for example, dogs can be described visually and verbally), capturing channel-invariant (or modality-invariant) factors at the semantic level. In vision-and-language tasks, salient objects in an image can be mostly detected by modern object detectors, and such objects are often mentioned in the paired text. For example, on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/cocodataset.org\/#home\">MS COCO dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the percentages that an image and its paired text share at least 1, 2, or 3 objects are 49.7%, 22.2%, and 12.9%, respectively.<\/p>\n<div id=\"attachment_658011\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-658011\" class=\"wp-image-658011 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar-1024x289.png\" alt=\"From left to right: A. Image-text pair. An image of a dog sitting on a couch with a box outlining the dog, labeled \"dog,\" and another box outlining the couch, labeled \"couch. Below this image is a sentence that reads \"A dog is sitting on a chair.\" B. Objects as anchor points. Language, on the left, points to a box labeled \"Pre-trained\" which moves through word embeddings and into object tag segments, including \"Dog\" and \"Couch.\" Image, on the right, points to \"Object Detector.\" From there, an arrow moves through region features and into object tags segments. Two arrows also go directly from \"Object Detector\" to specific object tags for \"Dog\" and \"Couch.\" C. Semantic spaces. Word embeddings are represented by many plotted yellow dots with a triangle in the center for \"couch\" and many red dots with a square in the center for dog. The many dots for each are separate. Above, Region Features show the two plots of dots joined but not intermingled, with the triangle and square together between both sets of plots.\" width=\"1024\" height=\"289\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar-1024x289.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar-300x85.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar-768x217.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar-1536x434.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-1-Oscar.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-658011\" class=\"wp-caption-text\">Figure 1: Illustration showing the process by which Oscar represents an image-text pair into semantic space. (a) An example of input image-text pair. (b) The object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. (c) The word semantic space is more representative than image region features.<\/p><\/div>\n<p>An example image-text pair is shown in Figure 1a. By utilizing a pre-trained <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Object_detection\">object detector<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1506.01497\">Faster R-CNN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the image can be represented as a set of visual region features, each of which is associated with an object tag. Accordingly, the sentence can be represented as a sequence of word embeddings using pre-trained language models such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Importantly, in Oscar we construct the representations of the object tags using their corresponding word embeddings from a pre-trained BERT.<\/p>\n<p>As conceptually illustrated in Figure 1b, this explicitly couples images and sentences in a shared space, allowing objects to play the role of anchor points to align the semantics of vision-and-language. The word embedding space of BERT is semantically well structured after massive pure-text pre-training\u2014this would further provide good initialization for the shared space. In this example, <em>dog<\/em> and <em>couch<\/em> are similar in the visual feature space due to the overlap regions, but they are distinctive in the word embedding space, as illustrated in Figure 1c.<\/p>\n<h3>Oscar learning pipeline<\/h3>\n<p>With object tags introduced as a new component, Oscar differs from existing VLP in two ways:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><strong>Input<\/strong> <strong>representation<\/strong>. As outlined in Figure 2 below, we define each image-text sample as a triplet, consisting of a word sequence, a set of object tags, and a set of image region features.<\/li>\n<li><strong>Pre<\/strong>&#8211;<strong>training<\/strong> <strong>objective<\/strong>. Depending on how the three items in the triplet are grouped, we view the input from two different perspectives: a modality view and a dictionary view. Each allows us to design a novel pre-training objective: 1) a masked token loss for the dictionary view, which measures the model\u2019s capability of recovering the masked element (word or object tag) based on its context; 2) a contrastive loss for the modality view, which measures the model\u2019s capability of distinguishing an original triple and its \u201cpolluted\u201d version (that is, where an original object tag is replaced with a randomly sampled one).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><div id=\"attachment_658014\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-658014\" class=\"wp-image-658014 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-2-Oscar-1024x235.png\" alt=\"From left to right, Orange box. Text in box reads \"[CLS] A dog is [MASK] on a couch [SEP]\" underscored by \"Word Tokens.\" Blue box. Text in box reads \"dog couch [SEP]\" underscored by \"Object Tags.\" Green Box. Images in box are a dog followed by a chair underscored by text \"Region Features.\" \" width=\"1024\" height=\"235\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-2-Oscar-1024x235.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-2-Oscar-300x69.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-2-Oscar-768x176.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-2-Oscar.png 1431w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-658014\" class=\"wp-caption-text\">Figure 2: Illustration of Oscar input. We represent the image-text pair as a triplet [word tokens in orange, object tags in blue, region features in green], where the object tags (in this case, dog or couch) are proposed to align the cross-domain semantics; when removed, Oscar reduces to previous VLP methods. The input triplet can be understood from two perspectives: a modality view and a dictionary view.<\/p><\/div>Our Oscar model is pre-trained on a large-scale image-text dataset composed of 6.5 million pairs. Oscar is fine-tuned and evaluated on a wide range of vision-and-language understanding and generation tasks, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/visualqa.org\/\">Visual Question Answering (VQA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cs.stanford.edu\/people\/dorarad\/gqa\/index.html\">Graph Question Answering (GQA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/lil.nlp.cornell.edu\/nlvr\/\">Natural Language Visual Reasoning for Real (NLVR2)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/kuanghuei\/SCAN\">Image-Text Retrieval<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/kuanghuei\/SCAN\">Text-Image Retrieval<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/cocodataset.org\/#captions-2015\">Image Captioning on COCO dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nocaps.org\/\">Novel Object Captioning (NoCaps)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The overall setting is illustrated in Figure 3.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_658017\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-658017\" class=\"wp-image-658017 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-3-Oscar-1024x328.png\" alt=\"A white box. Text in box reads Image-Text: 6.5M. One: Masked Token Loss. Two: Contrastive Loss. Image-Text Representation. Then, A word-tag-region triplet is shown in the white box in parentheses. Word: A dog is sitting on a couch. Tag: Dog, couch. Region: image of dog, image of couch. To the right, two green boxes. Top box: Understanding. Bulleted list: VQA, GQA, NLVR2, Image-Text Retrieval, Text-Retrieval. Bottom box: Generation. Bulleted list: Image captioning, Novel object captioning. \" width=\"1024\" height=\"328\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-3-Oscar-1024x328.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-3-Oscar-300x96.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-3-Oscar-768x246.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Figure-3-Oscar.png 1431w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-658017\" class=\"wp-caption-text\">Figure 3: Oscar takes a triple as input, is pre-trained with two losses (a masked token loss over words and tags and a contrastive loss between tags and others), and it is then fine-tuned for five understanding and two generation tasks.<\/p><\/div>\n<h3>State-of-the-art performance<\/h3>\n<p>To account for parameter efficiency, we compare models of different sizes in Table 1 below. Oscar achieves new state-of-the-art performance on six tasks. Our base model outperforms previous large models on most tasks, often by a significantly large margin. It demonstrates that Oscar is highly parameter efficient, partially because the use of object tags significantly eases the learning of semantic alignments between images and texts. Here, the VLP baseline methods are collected from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ChenRocks\/UNITER\">UNITER<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1908.02265\">VilBERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/airsplay\/lxmert\">LXMERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.11059\">VLP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1908.08530\">VL-BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1908.06066\">Unicoder-VL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1912.02315\">12-in-1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Note that Oscar is pre-trained on 6.5 million pairs, which is less than the 9.6 million pairs used for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/ChenRocks\/UNITER\">UNITER<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and the 9.18 million pairs for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/airsplay\/lxmert\">LXMERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<div id=\"attachment_659310\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-659310\" class=\"wp-image-659310 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-table-1-v2-1024x241.jpg\" alt=\"Table shows Oscar achieves higher performance than current state of the art for image retrieval, text retrieval, image captioning, NoCaps, V.Q.A., and N.L.V.R. 2. \" width=\"1024\" height=\"241\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-table-1-v2-1024x241.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-table-1-v2-300x71.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-table-1-v2-768x181.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-table-1-v2.jpg 1347w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-659310\" class=\"wp-caption-text\">Table 1: Oscar achieves the best performance on six established vision-and-language tasks. SoTA (state of the art) with subscript S, B, and L indicates the best performance achieved by small, base, and large models (sizes are measured by BERT). Blue indicates the best result for a task, and rows with a grey background indicate results produced by Oscar.<\/p><\/div>\n<h3>Improved image-text alignment<\/h3>\n<p>We visualize the learned semantic feature space of image-text pairs of the COCO test set on a 2D map using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/T-distributed_stochastic_neighbor_embedding\">t-SNE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. For each image region and word token, we pass it through the model and use its last-layer output as features. Pre-trained models with and without object tags are compared. The results in Figure 4 reveal some interesting findings. The first finding is intra-class: with the aid of object tags, the distance of the same object between two modalities is substantially reduced. For example, the visual and textual representations for <em>person<\/em> in Oscar is much closer than that in the baseline method, represented as red curves in Figure 4. The second finding is inter-class: object classes of related semantics are getting closer (but still distinguishable) after adding tags, while there is some mix in the baseline, such as animals (<em>zebra<\/em>, <em>elephant, sheep<\/em>, and so on) represented as grey curves in Figure 4. This verifies the importance of object tags in alignment learning: it plays the role of anchor points in linking and regularizing the cross-modal feature learning.<\/p>\n<div id=\"attachment_659313\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-659313\" class=\"wp-image-659313 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-fig-4-v2-1024x360.jpg\" alt=\"On the left, a heatmap of different word classifications. Inanimate objects, like couch, truck, and clock, are outside of a circled area. In one circle is \"person.\" In another circle, other animal categories like bird, dog, sheep, and others. In the right image, a smaller circle of animals. Person is circled by itself. Zebra is also in a circle of its own. The areas circled are more distinct in this image. \" width=\"1024\" height=\"360\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-fig-4-v2-1024x360.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-fig-4-v2-300x105.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-fig-4-v2-768x270.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/Oscar-fig-4-v2.jpg 1326w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-659313\" class=\"wp-caption-text\">Figure 4: 2D visualization using t-SNE. The points from the same object class share the same color. Oscar (left) improves the cross-domain alignment over the baseline without object tags (right). Red and grey curves cover the objects of the same and related semantics, respectively.<\/p><\/div>\n<h3>Looking forward<\/h3>\n<p>Oscar has demonstrated the power of using objects as anchor points in aligning the image and language modalities. Interesting directions of future work include generalizing Oscar to incorporate more modalities, such as speech or multilingual abilities, and using objects as a natural bridge to distill the knowledge from images to improve natural language tasks.<\/p>\n<p><em><strong>Acknowledgments <\/strong><\/em><\/p>\n<p>This research was conducted by <em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiul\/\">Xiujun Li<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/xiyinmsu.github.io\/\">Xi Yin<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/chunyuan.li\/\">Chunyuan Li<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/penzhan\/\">Pengchuan Zhang<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/xiaowei-hu\">Xiaowei Hu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/leizhang\/\">Lei Zhang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\">Lijuan Wang<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/houdong-hu-08334227\">Houdong Hu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lidong1\/\">Li Dong<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fuwei\/\">Furu Wei<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/homes.cs.washington.edu\/~yejin\/\">Yejin Choi<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>.<\/em> Additional thanks go to the entire Project Philly team inside Microsoft, who provided us the computing platform for our research. The implementation in our experiments depends on open source GitHub repositories; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse the information collected from multiple channels to grasp the key concepts needed for a better understanding of the world. One [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":659415,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Chunyuan Li","user_id":"37971"},{"type":"user_nicename","value":"Lei Zhang","user_id":"32641"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-657990","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931,664548,737755],"related-projects":[737098,649749],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-960x540.png\" class=\"img-object-cover\" alt=\"Oscar object semantics graphic\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/1400x788_NoLogo_Oscar_Still-01-1920x1080.png 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Chunyuan Li, Lei Zhang, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"May 15, 2020","formattedExcerpt":"Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse the information collected from multiple channels to grasp the key&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657990","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=657990"}],"version-history":[{"count":13,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657990\/revisions"}],"predecessor-version":[{"id":660699,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657990\/revisions\/660699"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/659415"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=657990"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=657990"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=657990"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=657990"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=657990"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=657990"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=657990"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=657990"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=657990"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=657990"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=657990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}