{"id":696898,"date":"2020-10-14T08:02:27","date_gmt":"2020-10-14T15:02:27","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=696898"},"modified":"2020-10-14T10:44:49","modified_gmt":"2020-10-14T17:44:49","slug":"novel-object-captioning-surpasses-human-performance-on-benchmarks","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/novel-object-captioning-surpasses-human-performance-on-benchmarks\/","title":{"rendered":"Novel object captioning surpasses human performance on benchmarks"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_Slower.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>Consider for a moment what it takes to visually identify and describe something to another person. Now imagine that the other person can\u2019t see the object or image, so every detail matters. How do you decide what information is important and what\u2019s not? You\u2019ll need to know exactly what everything is, where it is, what it\u2019s doing in relation to other objects, and note other attributes like color or position of objects in the foreground or background. This exercise shows there\u2019s no question that translating images into words is a complex task\u2014one humans do so often and innately it seems automatic at times\u2014requiring a wide range of knowledge about many unique things.<\/p>\n\n\n\n<p>In order to translate this skill into artificial intelligence (AI), we need to carefully consider and adapt models to the deep relationships between words and objects, the way they interrelate in expected and unexpected ways, and how contexts like environment and pose of an object affect the subtleties of associating and understanding new objects within categories. In AI, this means exploring new ways of training models, untethered to traditional annotation-reliant methods that require sentence-image pairs. To this aim, researchers from the Microsoft Azure Cognitive Services team and Microsoft Research have created VIVO (Visual Vocabulary Pretraining), an image-captioning milestone that performs pretraining in the absence of caption annotations and results in new state-of-the-art performance on novel object captioning.<\/p>\n\n\n\n<h2 id=\"refining-vision-and-language-pretraining-for-novel-object-captioning\">Refining vision and language pretraining for novel object captioning<\/h2>\n\n\n\n<p>Novel object captioning (NOC) aims to generate image captions capable of describing novel objects that are not present in the caption training data. NOC can add value to a variety of applications, such as human-computer interaction and image-language understanding. However, NOC is a challenging problem as it requires a visual system to recognize novel objects, and it also needs a language model to generate fluent sentences describing the objects.<\/p>\n\n\n\n<p>Recently, researchers have developed the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nocaps.org\/\">novel object captioning challenge (nocaps)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to evaluate NOC. In this challenge, existing computer vision techniques can be leveraged to recognize novel objects. For example, prior studies have proposed generating template sentences that are filled in with the recognized visual concepts from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Object_detection\">object detectors<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. However, the captioning capability is limited by object detection vocabulary, and the context of objects can hardly be well described by pre-defined templates.<\/p>\n\n\n\n<p>Vision and language pretraining (VLP) has shown to be effective for cross-modal representation learning. Prior works have explored training Transformer-based models on large amounts of image-sentence pairs. The learned cross-modal representations can be fine-tuned to improve the performance on image captioning, such as <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/\">VLP <\/a>and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/\">OSCAR<\/a>. However, these prior works rely on large amounts of image-sentence pairs for pretraining. When it comes to the nocaps challenge, where no additional paired image-sentence training data is allowed, none of the prior VLP techniques are readily applicable. <\/p>\n\n\n\n<p>This blog post introduces VIVO, developed by the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/\">Microsoft Azure Cognitive Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> team and Microsoft Research, which performs pretraining in the absence of caption annotations. By breaking the dependency of paired image-sentence training data in VLP, VIVO can leverage large-scale vision datasets with image-tag pairs in pretraining to learn cross-modality alignment, building a rich visual vocabulary at scale. Our discovery leads to a new captioning framework that creates <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/evalai.cloudcv.org\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\">new state-of-the-art performance on the nocaps benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and surpasses human performance for the first time.<\/p>\n\n\n\n<p>Please check out our paper, titled \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/vivo-surpassing-human-performance-in-novel-object-captioning-with-visual-vocabulary-pre-training\/\">VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training<\/a>,\u201d for more details, and gain further insight into the researchers\u2019 perspectives on how this breakthrough impacts caption generation in Azure AI and accessibility in this <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/AA99bjt\">blog post<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from The AI Blog. <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"361\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/NOCAPS_Figure1.png\" alt=\"Three boxes are shown, one above and two below. Top box: VIVO Pre-training. images of a man sitting outdoors with a laptop and a dog, a dog, and a man playing an accordion and singing are labeled (image, tags). Next to this visual vocabulary is represented by two images of a dog with an arrow pointing to zeroes in a cloud. The label \"dog\" has an arrow pointing to a plus sign in the cloud. This same configuration is shown for an accordion (instrument), couch, and two different images of a man (person).\n\nLower box, left: Fine-tuning. An image of a woman on a couch with a dog labeled (image, sentence, tags). Text reads \"A person holding a dog sitting on a couch.\"\n\nLower box, right: Inference. An image of a man holding an umbrella in one hand and an accordion in the other. He is wearing a tall hat and sits on a couch. Text read \"A person holding a black umbrella and an accordion.\"\n\" class=\"wp-image-696910\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/NOCAPS_Figure1.png 624w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/NOCAPS_Figure1-300x174.png 300w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><figcaption>Figure 1: VIVO pretraining uses paired image-tag data to learn a rich visual vocabulary where image region features and tags of the same object are aligned. Fine-tuning is conducted on paired image-sentence data that only cover a limited number of objects (in blue). During inference, our model can generalize to describe novel objects (in yellow) that are learned during VIVO pretraining.<\/figcaption><\/figure><\/div>\n\n\n\n<p>As shown in Figure 1, we define visual vocabulary as a joint embedding space where the image region features and tags of the semantically similar objects are mapped into feature vectors that are close to each other, for example \u201cperson\u201d and \u201cman\u201d or \u201caccordion\u201d and \u201cinstrument.\u201d Once the model is pretrained, a fine-tuning using image-caption pairs is conducted to learn caption generation. Note that the fine-tuning dataset only covers a subset of the most common objects in the learned visual vocabulary. Nevertheless, our model can still generalize to test images that contain a similar scene (like the people sitting on couches in Figure 1) with novel objects unseen in the fine-tuning dataset (like \u201caccordion\u201d), thanks to the visual vocabulary learned in the pretraining stage.<\/p>\n\n\n\n<p>Our VIVO pretraining learns to ground the image regions to the object tags. In fine-tuning, our model learns how to compose natural language captions. The combined skill achieves the compositionality generalization, allowing for zero-shot captioning on novel objects.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"579\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres-1024x579.jpg\" alt=\"(a) Pretraining: learn visual vocabulary. An image shows red bounding boxes around objects in an image of a man playing accordion and singing--they ID an audience member, a hat, the man, and an accordion. Text reads \"Open Images 6.4K tags without caption.\" To the right, a training scheme labels represented by green dots (animal, person, [mask], hat), and the objects from bounding boxes in the image represented with yellow dots. These move through a multi-layer transformer and result in identifying an accordion. To the far right, a 9 by 10 grid has 36 cells vertical in green, 45 cells in yellow, labeled \"attention mask.\"\n\n(b) Fine-tuning: kearn sentence description. Image of a woman holding a dog sitting on a couch. Text reads the same. Image has bounding boxes around those three objects, labeled \"COCO 80 objects with caption.\" To the right, a similar structure as shown in (a), adding orange dots indicating: [CLS] a [mask] holding a [mask] sitting on a couch [SEP]. Labels: person, dog, couch (green). image objects (yellow). Scheme results in person, dog. Far right: attention mask 9 cells of 27 in first column are orange, 27 cells in second column are green. 27 cells in third column are yellow. \n\n(c) Inference: novel object captioning. Image of man with umbrella, accordion, tall hat. Umbrella, man, accordion have bounding boxes. To the right, orange: [CLS] [MASK]. Green: person, umbrella, accordion. Image objects in yellow. Result is \"accordion.\"\n\" class=\"wp-image-697933\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres-1024x579.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres-300x170.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres-768x435.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figure2_updateres.jpg 1391w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: The proposed training scheme. (a) In VIVO pretraining, we train a Transformer model on (image, tag) pairs for tag prediction, where it learns cross-modal representations for rich visual concepts. (b) In fine-tuning, we train the same model on limited (image, sentence) pairs to learn how to generate captions that are conditional on the image and tags. (c) During inference, given the image and detected tags, our model is applied iteratively to generate a sequence of words describing novel objects in an auto-regressive manner.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Our training scheme consists of three main stages as shown in Figure 2. In pretraining, we feed to a multi-layer Transformer model, with the input consisting of the image region features and the paired image-tag set. In these sets of tags, single images can have multiple tags associated with them. We then randomly mask one or more tags, and we ask the model to predict these masked tags, conditioned on the image region features and the other tags. Given that tags are not ordered, we develop a Hungarian matching loss for tag prediction.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<p>After pretraining, the Transformer model is fine-tuned on a dataset where both captions and tags are available, like the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cocodataset.org\/#home\">COCO dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with 80 object classes and caption annotations. The tags can also come from prediction of a tagging or detection model.<\/p>\n\n\n\n<p>In the inference stage, given the input image and the detected tags, our model generates a set of word tokens in an auto-regressive manner to form the final output caption.<\/p>\n\n\n\n<h2 id=\"state-of-the-art-performance-and-exceeding-human-cider-scores\">State-of-the-art performance and exceeding human CIDEr scores<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"522\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_updated-res-Table-1024x522.jpg\" alt=\"\" class=\"wp-image-697930\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_updated-res-Table-1024x522.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_updated-res-Table-300x153.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_updated-res-Table-768x391.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_updated-res-Table.jpg 1513w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n<p>We compare our method with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1812.08658\">UpDown <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">OSCAR<\/a>, which represent the state of the art on nocaps benchmark. The training data for the baselines is the COCO dataset. Following prior settings, we also add the results after using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/papers\/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.pdf\">SCST <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.aclweb.org\/anthology\/D17-1098.pdf\">Constrained Beam Search (CBS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>The evaluation results on nocaps validation and test sets are shown in Table 1. Our method has achieved significant improvement compared with prior works. Our plain version (VIVO) already outperforms UpDown+ELMo+CBS and OSCAR by a large margin. Our results have achieved new <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/evalai.cloudcv.org\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\">state-of-the-art results and surpassed human CIDEr scores<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on the overall dataset.<\/p>\n\n\n\n<h2 id=\"visual-text-alignment-precise-localization-of-novel-objects\">Visual-text alignment: precise localization of novel objects<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"652\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figre-3_updatedres-1024x652.jpg\" alt=\"\" class=\"wp-image-697936\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figre-3_updatedres-1024x652.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figre-3_updatedres-300x191.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figre-3_updatedres-768x489.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/Nocaps_figre-3_updatedres.jpg 1194w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: Image captioning results on nocaps. B: our baseline without adding VIVO pretraining. V: our approach with VIVO pretraining. Red text represents novel objects. For each image, we show the similarity scores of each image region to the novel objects appear in the captions. The bounding box color is brighter when the similarity is higher.<\/figcaption><\/figure><\/div>\n\n\n\n<p>To further understand the effects of VIVO pretraining in learning visual vocabulary, that is aligning image regions with object tags, we show how the novel object tags can be grounded to image regions. We estimate the similarity between the representations of each image region and object tag pair. We highlight the pairs with high scores in Figure 3. The results show that our model can precisely localize these novel objects.<\/p>\n\n\n\n<h2 id=\"looking-forward-high-potential-for-performance-improvements\">Looking forward: High potential for performance improvements<\/h2>\n\n\n\n<p>We have demonstrated the power of learning visual vocabulary for novel object captioning. As the first VLP method that does not rely on paired image-sentence data, VIVO can leverage a large-scale vision dataset with image-tag pairs in pretraining. It is worth noting that using machine-generated image tags rather than human-written captions makes it possible to utilize potentially unlimited training images for improving the performance, which we will pursue in our future work.<\/p>\n\n\n\n<p>We will have more updates in the coming months. Please check out our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/azure-florence-vision-and-language\/\">project page<\/a> to learn more about our technology and future updates.<\/p>\n\n\n\n<p>This research was conducted by <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaowh\/\">Xiaowei Hu<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/keli\/\">Kevin Lin<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\">Lijuan Wang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/leizhang\/\">Lei Zhang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zliu\/\">Zicheng Liu<\/a> from the Microsoft Azure Cognitive Services team in collaboration with Microsoft Research. This research is part of our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/azure-florence-vision-and-language\/\">Azure Florence research initiative on vision and language<\/a>, sponsored by Microsoft Azure Cognitive Services.<\/p>\n\n\n\n<p><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Consider for a moment what it takes to visually identify and describe something to another person. Now imagine that the other person can\u2019t see the object or image, so every detail matters. How do you decide what information is important and what\u2019s not? You\u2019ll need to know exactly what everything is, where it is, what [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":697999,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Kevin Lin","user_id":"39694"},{"type":"user_nicename","value":"Xiaowei Hu","user_id":"39697"},{"type":"user_nicename","value":"Lijuan Wang","user_id":"32680"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-696898","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931,737755],"related-projects":[689814,279642],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Kevin Lin","user_id":39694,"display_name":"Kevin Lin","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/keli\/\" aria-label=\"Visit the profile page for Kevin Lin\">Kevin Lin<\/a>","is_active":false,"last_first":"Lin, Kevin","people_section":0,"alias":"keli"},{"type":"user_nicename","value":"Lijuan Wang","user_id":32680,"display_name":"Lijuan Wang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\" aria-label=\"Visit the profile page for Lijuan Wang\">Lijuan Wang<\/a>","is_active":false,"last_first":"Wang, Lijuan","people_section":0,"alias":"lijuanw"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-960x540.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-1536x864.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-2048x1152.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/10\/1400x788_NoCaps_NoLogo_Still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/keli\/\" title=\"Go to researcher profile for Kevin Lin\" aria-label=\"Go to researcher profile for Kevin Lin\" data-bi-type=\"byline author\" data-bi-cN=\"Kevin Lin\">Kevin Lin<\/a>, Xiaowei Hu, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\" title=\"Go to researcher profile for Lijuan Wang\" aria-label=\"Go to researcher profile for Lijuan Wang\" data-bi-type=\"byline author\" data-bi-cN=\"Lijuan Wang\">Lijuan Wang<\/a>","formattedDate":"October 14, 2020","formattedExcerpt":"Consider for a moment what it takes to visually identify and describe something to another person. Now imagine that the other person can\u2019t see the object or image, so every detail matters. How do you decide what information is important and what\u2019s not? You\u2019ll need&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/696898","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=696898"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/696898\/revisions"}],"predecessor-version":[{"id":698038,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/696898\/revisions\/698038"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/697999"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=696898"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=696898"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=696898"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=696898"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=696898"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=696898"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=696898"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=696898"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=696898"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=696898"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=696898"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}