{"id":965166,"date":"2023-09-06T12:53:53","date_gmt":"2023-09-06T19:53:53","guid":{"rendered":""},"modified":"2024-02-05T08:03:08","modified_gmt":"2024-02-05T16:03:08","slug":"frontiers-of-multimodal-learning-a-responsible-ai-approach","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/frontiers-of-multimodal-learning-a-responsible-ai-approach\/","title":{"rendered":"Frontiers of multimodal learning: A responsible AI approach"},"content":{"rendered":"<aside id=accordion-7b82c6b3-0147-4ebd-828c-bdcc07f0d5b0 class=\"msr-table-of-contents-block accordion mb-5 pb-0\" data-bi-aN=\"table-of-contents\">\n\t<button class=\"btn btn-collapse bg-gray-100 mb-0 display-flex justify-content-between\" type=\"button\" data-mount=\"collapse\" data-target=\"#accordion-collapse-7b82c6b3-0147-4ebd-828c-bdcc07f0d5b0\" aria-expanded=\"true\" aria-controls=\"accordion-collapse-7b82c6b3-0147-4ebd-828c-bdcc07f0d5b0\">\n\t\t<span class=\"msr-table-of-contents-block__label subtitle\">In this article<\/span>\n\t\t<span class=\"msr-table-of-contents-block__current mr-4 text-gray-600 font-weight-normal\" aria-hidden=\"true\"><\/span>\n\t<\/button>\n\t<div id=\"accordion-collapse-7b82c6b3-0147-4ebd-828c-bdcc07f0d5b0\" class=\"msr-table-of-contents-block__collapse-wrapper collapse show\" data-parent=\"#accordion-7b82c6b3-0147-4ebd-828c-bdcc07f0d5b0\">\n\t\t<div class=\"accordion-body bg-gray-100 border-top pt-4\">\n\t\t\t<ol class=\"msr-table-of-contents-block__list\">\n\t\t\t\t\t\t\t\t\t<li class=\"msr-table-of-contents-block__list-item\">\n\t\t\t\t\t\t<a href=\"#unmasking-hidden-societal-biases-across-modalities\" class=\"msr-table-of-contents-block__list-item-link\">Unmasking hidden societal biases across modalities\u00a0<\/a>\n\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t\t<li class=\"msr-table-of-contents-block__list-item\">\n\t\t\t\t\t\t<a href=\"#navigating-distributional-shifts-and-spurious-correlations\" class=\"msr-table-of-contents-block__list-item-link\">Navigating distributional shifts and spurious correlations\u00a0<\/a>\n\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t\t<li class=\"msr-table-of-contents-block__list-item\">\n\t\t\t\t\t\t<a href=\"#decomposing-evaluation-for-controllability-and-precision-in-multimodal-generation\" class=\"msr-table-of-contents-block__list-item-link\">Decomposing evaluation for controllability and precision in multimodal generation\u00a0<\/a>\n\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t\t<li class=\"msr-table-of-contents-block__list-item\">\n\t\t\t\t\t\t<a href=\"#beyond-offline-benchmarks-leveraging-adaptation-and-continual-learning-approaches\" class=\"msr-table-of-contents-block__list-item-link\">Beyond offline benchmarks: Leveraging adaptation and continual learning approaches\u00a0\u00a0<\/a>\n\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t\t\t<li class=\"msr-table-of-contents-block__list-item\">\n\t\t\t\t\t\t<a href=\"#related-reading\" class=\"msr-table-of-contents-block__list-item-link\">Related reading<\/a>\n\t\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<\/ul>\n\t\t<\/div>\n\t<\/div>\n\t<span class=\"msr-table-of-contents-block__progress-bar\"><\/span>\n<\/aside>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1.png\" alt=\"Responsible AI blog - hero graphic with connected circles with icons depicting closed captions, calendar, image, and document inside of the circles\" class=\"wp-image-965712\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>In the realm of AI, the new frontier isn\u2019t confined to a singular form of expression; fast-paced developments are happening at the juncture of multiple modalities. Multimodal AI systems that can analyze, synthesize, and generate across text, images, and other data types are paving the way for exciting applications in areas such as productivity, health care, creativity, and automation. As human perception and problem-solving in the physical world leverage multiple modalities, such multimodal systems provide an even more natural and seamless support than those operating across a single modality. These emerging AI systems are powered by the fusion of vast datasets and advanced architectures such as transformers. Yet as we test and advance the capabilities of these systems, a critical question emerges: how do we ensure their responsible development and deployment? One way is through the rigorous evaluation of their underlying models.&nbsp;<\/p>\n\n\n\n<p>When you traverse the digital universe, the richness of content is evident\u2014videos are intertwined with text, images give context to articles, and audio files often come with transcriptions. In this digital tapestry, multimodal models are the weavers, bringing together different threads into a coherent whole. However, like all tools, they aren\u2019t without their challenges. Their evaluation requires a nuanced understanding that transcends traditional metrics.<\/p>\n\n\n\n<p>At Microsoft, we experiment with and build upon open-source models. We also have had the opportunity and the privilege of studying cutting-edge models developed within Microsoft and by OpenAI. Gaining early access to these models helps us to study the models\u2019 capabilities and understand their limitations and failure modes and plan for mitigations before they\u2019re integrated into products or released more broadly. Several years ago, the <a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/principles-and-approach\/#tabs-pill-bar-ocb9d4_tab1\" target=\"_blank\" rel=\"noreferrer noopener\">Aether Committee<\/a> at Microsoft established special cross-company workstreams to rigorously study foundation models and new applications of foundation models early on with a focus on <em>surveying<\/em> and <em>identifying<\/em> potential risks and harms. Resulting reports and briefings inform two critical next steps for Microsoft: <em>research efforts<\/em> for further deep-dive investigations on model capabilities and limitations and <em>engineering efforts<\/em> for measurement and mitigations of these risks.<\/p>\n\n\n\n<p>A study was stood-up to explore multimodal text-to-image models. The study was done jointly with colleagues at OpenAI and included contributors with different backgrounds and with diverse expertise, such as in engineering, AI and responsible AI research, security, and policy. The study included red teaming for understanding the failure modes and surfacing examples in which such failures are more common; investigating interaction paradigms and best practices for deploying multimodal models responsibly; initial engineering efforts to build measurement and mitigation techniques for incorporation into the model development and deployment life cycle; and longer-term considerations of these models, such as their impact on artists&#8217; rights and jobs. The findings inspired further investigation into more formally quantifying these failures, specifically as they relate to fairness-related harms. In this blog, we\u2019ll cover some of that research and other groundbreaking work into multimodal AI from Microsoft Research, examining the complexities of evaluating multimodal models and paths toward their improvement. Our perspective is framed by four key observations:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The combination of different content types brings new risks of unintended harms that may happen even when using \u201csafe\u201d system inputs.&nbsp;<\/li>\n\n\n\n<li>Internet-scale data\u2019s large size and diversity enable the development of models capable of a wide range of tasks. But this data doesn\u2019t reflect every aspect of reality, leading models to underperform in the presence of distribution shifts and spurious correlations.&nbsp;&nbsp;<\/li>\n\n\n\n<li>General-purpose scores used in current benchmarks can\u2019t fully assess the controllability, or how much influence users have in getting the precise output they want, of generative capabilities. Assessing controllability requires new protocols that decompose evaluation by focusing on fundamental skills\u2014that is, those important across many scenarios.&nbsp;<\/li>\n\n\n\n<li>To bridge the gap between what offline measures can capture and the capabilities of models in the open world, researchers and developers must embrace adaptation and continual learning approaches, which come with challenges of their own.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"unmasking-hidden-societal-biases-across-modalities\">Unmasking hidden societal biases across modalities&nbsp;<\/h2>\n\n\n\n<p><strong>Observation 1<\/strong>: The combination of different content types brings new risks of unintended harms that may happen even when using \u201csafe\u201d system inputs.<\/p>\n\n\n\n<p>Our research has demonstrated that new risks of unintended harms can arise on \u201cboth sides\u201d of vision + language models. For example, the recent study <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/social-biases-through-the-text-to-image-generation-lens\/\" target=\"_blank\" rel=\"noreferrer noopener\">&#8220;Social Biases through the Text-to-Image Generation Lens&#8221;<\/a> shows that prompts such as \u201ca photo of a chief executive officer\u201d and \u201ca photo of a computer programmer\u201d to DALL-E v2 result in images with no representation of individuals that are perceived as female by human annotators. Even though the natural language prompt doesn\u2019t contain language that reinforces societal bias, the image outputs\u2019 lack of female representation runs counter to labor statistics, reinforcing the harmful stereotype that there are no female CEOs or programmers and\/or that women aren\u2019t capable of filling such occupations. Similarly, prompts such as \u201ca photo of a nurse\u201d and \u201ca photo of a housekeeper\u201d to Stable Diffusion result in images with no representation of individuals that are perceived as male by annotators. Beyond prompts related to occupation, the study goes on to show that prompts related to personality traits and everyday situations may also fail to generate diverse outputs. For example, a prompt for \u201cwedding\u201d may cause the system to generate images that only correspond to the visual style of Western weddings. This study shows that even when given more explicit prompts that name a particular geography, such as \u201ca birthday party in Nigeria,\u201d these systems may generate notably lower-quality images.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"301\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1-1024x301.jpg\" alt=\"Figure 1: Examples of generations for the occupations of \u201ccomputer programmer\u201d and \u201chousekeeper\u201d using the DALL-E v2 and Stable Diffusion models.\" class=\"wp-image-965184\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1-1024x301.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1-300x88.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1-768x225.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1-240x70.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_1.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: A sample of the first four images generated for the occupations of \u201ccomputer programmer\u201d and \u201chousekeeper\u201d using the DALL-E v2 and Stable Diffusion models. Notably, one gender (as perceived by human annotators) is conspicuously absent across a distribution of 500 generated images.<\/figcaption><\/figure>\n\n\n\n<p>Similarly, for image-to-text scenarios, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/measuring-representational-harms-in-image-captioning\/\">a different study<\/a> shows that for images of common situations (from the COCO dataset), models can generate captions that either exclude or erroneously add words in a way that may be explained by societal biases in the training data.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"258\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2-1024x258.jpg\" alt=\"Figure 2: Examples of system-generated captions for images from the COCO dataset. They include inaccuracies that are likely explained by stereotypes. For example, an image of a woman holding a hair dryer being captioned \u201ca woman wearing glasses holding a bottle of wine.\u201d\" class=\"wp-image-965187\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2-1024x258.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2-300x76.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2-768x194.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2-240x61.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_2.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Examples of system-generated captions for the COCO dataset that are likely explained by stereotypes. These cases include text that is factually incorrect in bold. <\/figcaption><\/figure>\n\n\n\n<p><strong>Strategies for evaluation and model improvement:<\/strong><\/p>\n\n\n\n<p>As explored in \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/taxonomizing-and-measuring-representational-harms-a-look-at-image-tagging\/\" target=\"_blank\" rel=\"noreferrer noopener\">Taxonomizing and Measuring Representational Harms: A Look at Image Tagging<\/a>,\u201d there\u2019s no one way to identify or measure representational harms\u2014representations that cast some social groups as less worthy of attention than others or inferior to others. Notably, while some harms will be apparent when looking at individual inputs or outputs in isolation, others will only become apparent when looking at inputs or outputs in combination or across multiple generations. As a result, effective evaluation will often require a mix of measurement approaches, including checking for specific inputs or outputs (or specific input-output pairs) that are known to be objectionable, looking for differences in the accuracy or quality of the outputs by demographic group, reviewing differences in the distribution of outputs by demographic group, and determining how specific perturbations to inputs affect outputs, among others.<strong> <\/strong>Another key insight that can be drawn from this work and the text-to-image generation work is the need for content filtering or selection strategies that can operate on different modalities to address potential harms in both input and output and at different stages in the generation process.<\/p>\n\n\n\n<p>Another mitigation technique explored by the text-to-image generation work is prompt expansion. Adding descriptors to initial prompts\u2014for example, specifying \u201cfemale\u201d in the prompt \u201ca portrait of an announcer\u201d\u2014was shown to be mostly effective at creating the specified content; however, the resulting content had lower diversity across both demographic characteristics and features like background and dress and lower image quality as illustrated in Figure 3. Given these additional concerns, while it\u2019s useful to increase control through expanded prompts, at the same time, it\u2019s also important to provide sufficient transparency and agency for people to control prompt expansion so they can achieve desired results.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"598\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-1024x598.jpg\" alt=\"Figure 2: Examples of system-generated captions for images from the COCO dataset. They include inaccuracies that are likely explained by stereotypes. For example, an image of a woman holding a hair dryer being captioned \u201ca woman wearing glasses holding a bottle of wine.\u201d\" class=\"wp-image-965190\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-1024x598.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-300x175.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-768x448.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-480x280.jpg 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3-240x140.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_3.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Expanded prompts using descriptors such as &#8220;female&#8221; can indeed yield more diverse depictions but often at the cost of image diversity and quality. The higher the Fr\u00e9chet inception distance (FID), which measures image quality, the further away the generated images are from real images. Surprisingly, the FID score for the prompt \u201ca portrait of a male announcer\u201d is 164.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"navigating-distributional-shifts-and-spurious-correlations\">Navigating distributional shifts and spurious correlations&nbsp;<\/h2>\n\n\n\n<p><strong>Observation 2<\/strong>: Internet-scale data\u2019s large size and diversity enable the development of models capable of a wide range of tasks. But this data doesn\u2019t reflect every aspect of reality, leading models to underperform in the presence of distribution shifts and spurious correlations.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Historically, machine learning models have functioned within a closed-world assumption, limited by their training data or specific application contexts. The advent of internet-scale data and its seeming potential to transcend these boundaries have generated a lot of excitement, but the reality is that there are significant problems. The vast diversity found in internet-scale datasets doesn&#8217;t necessarily mirror real-world distributions. Certain everyday objects or concepts might still be rare or underrepresented, for example, in safety-critical applications such as assisting people with disabilities, as shown in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/disability-first-datasets\/\">&#8220;Disability-first Dataset Creation: Lessons from Constructing a Dataset for Teachable Object Recognition with Blind and Low Vision Data Collectors.&#8221;<\/a>&nbsp;&nbsp;<\/p>\n\n\n\n<p>Consequently, multimodal foundation models, despite their vast training datasets, remain susceptible to distribution shifts\u2014that is, differences between training data and real-world data \u2014and spurious correlations, or instances where a coincidental feature might wrongly influence a model&#8217;s prediction. In the recent paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mitigating-spurious-correlations-in-multi-modal-models-during-fine-tuning\/\">&#8220;Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning,&#8221;<\/a> researchers found that models such as CLIP aren\u2019t able to perform well when spurious correlations are absent from examples at test time. For example, CLIP is 92.9 percent accurate (zero-shot) in classifying pacifiers when there\u2019s a baby in the picture but only 30.8 percent accurate when there\u2019s no baby in the picture. Gradient-based explanations show that in many cases, even if the model is accurate, it focuses on the baby face or the background to make a prediction, in which case, it\u2019s right for the wrong reason. In this example, then, the spurious feature is the baby. The model is more likely to make a correct prediction when a pacifier is in the presence of a baby and less likely to do so in the baby\u2019s absence.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"460\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4-1024x460.jpg\" alt=\"Figure 4: Left: Examples of a large multimodal foundational model learning to rely on spurious correlations to make predictions. Examples include images of baby pacifiers, can openers, erasers, whistles, and pencil sharpeners. Right: Illustration of mode explanations shifting from the spurious feature to correct ones after mitigation.\" class=\"wp-image-965193\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4-1024x460.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4-300x135.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4-768x345.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4-240x108.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_4.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Examples of a large multimodal foundation model learning to rely on spurious correlations to make predictions.<\/figcaption><\/figure>\n\n\n\n<p><strong>Strategies for evaluation and model improvement:<\/strong> <em>Error analysis across different conditions and disaggregated evaluations<\/em>: Out-of-distribution detection literature recommends complementing average accuracy with worst-group accuracy. Groups with the worst accuracy often also correspond to those most affected by distributional shifts and spurious correlations. Extended approaches of error analysis and fairness assessment suggest that the disaggregation, or the breaking up, of evaluation across different input conditions can provide an effective means of discovering and evaluating common reliability or fairness concerns, as well as spurious correlations (see below for a list of <a href=\"#literature\">literature on disaggregated evaluation and error analysis<\/a>). In the past, input conditions for vision and multimodal tasks have been specified either from metadata or visual features such as light conditions, image size, color, blur, and image quality (\u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-large-scale-robustness-analysis-of-video-action-recognition-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">A Large-scale Robustness Analysis of Video Action Recognition Models<\/a>\u201d breaks down the performance of convolutional and transformer vision models in the presence of such perturbations). Today, the availability of open-vocabulary models\u2014that is, those that aren\u2019t restricted to a predefined closed set of concepts\u2014creates the possibility of generating soft tags or metadata for visual content, which can then be used for characterizing failure, as shown in the mitigating spurious correlations work. To illustrate this, the work used an object detection model with an open vocabulary to detect content tags\u2014such as <em>baby<\/em>, <em>can<\/em>, <em>hand<\/em>, <em>ring<\/em>, and <em>pencil<\/em>, as shown in Figure 4\u2014and then used this to analyze whether the presence or absence of such content is related to significant drops in accuracy.<\/p>\n\n\n\n<p><em>Evaluating if the model is right for the right reasons:<\/em> Besides error analysis, a crucial part of the evaluation is whether the model is right for the right reasons, as explored in the mitigating spurious correlations work. Going back to the pacifier example, it\u2019s great that a model can identify a pacifier when it\u2019s beside or being used by a baby, but in the real world, that\u2019s not always the case. Pacifiers could be under a couch, on top of a table, or on a store shelf, all scenarios in which the model is less likely to identify it correctly. The check on whether the model is \u201cright for the right reasons\u201d can be different in different contexts. For example, in image classification, the intersection between the model explanation and the ground truth bounding box is a good indicator. This metric is called <em>Adjusted Intersection-over-Union<\/em>, and together with worst-group accuracy, it provides a good picture for evaluating the presence of spurious correlations.<\/p>\n\n\n\n<p>Another example of enriching common metrics with methods that also test reasons behind predictions has been presented in the earlier paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/squinting-at-vqa-models-introspecting-vqa-models-with-sub-questions\/\">&#8220;SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions,&#8221;<\/a> which examines visual question-answering (VQA) tasks. Given the observation that VQA models may demonstrate statistical biases on particular answers (for example, mostly answering \u201cyes\u201d for yes\/no questions), the work proposes a benchmark, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/VQA-introspect\" target=\"_blank\" rel=\"noopener noreferrer\">VQA-Introspect<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and a model that decomposes the larger task into smaller simpler tasks. For example, if the question about a photo is \u201cDoes there appear to be an emergency in the photo?\u201d and the model can correctly answer this question with \u201cyes,\u201d it should also be able to answer simpler questions such as \u201cIs there a fire truck in the photo?\u201d<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"167\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5-1024x167.jpg\" alt=\"Figure 5: Examples of images, a main reasoning question, and sub-questions from the VQA-Introspect dataset. Left: An image of a wedding photo and the main reasoning question \u201cIs this a keepsake photo? Yes.\u201d \u201cIs this a black-and-white photo? Yes\u201d is among the sub-questions. Right: An image of a giraffe in a zoo and corresponding questions.\" class=\"wp-image-965196\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5-1024x167.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5-300x49.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5-768x126.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5-240x39.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_5.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Examples from the VQA-Introspect dataset, which decomposes the ability of the model to answer complex questions by also evaluating whether the model is able to answer simpler sub-questions that are necessary for the main question.<\/figcaption><\/figure>\n\n\n\n<p>To better understand the relationship between visual perception and reasoning capabilities of a model on VQA tasks, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/neuro-symbolic-visual-reasoning-disentangling-visual-from-reasoning-2\/\">\u201cNeuro-Symbolic Visual Reasoning: Disentangling \u2018Visual\u2019 from \u2018Reasoning\u2019\u201d<\/a> separates these two aspects by evaluating the quality of object detection and relation representation learning independently of each other. This is an important distinction for debugging when and how a lack of reasoning happens in a model. The study finds that when a model has access to ground-truth visual information, it can solve a challenging VQA task with 96 percent accuracy using first-order logic only, demonstrating that model success in the task is related to having a better visual feature extraction method. Leveraging this finding, the paper presents a methodology to improve weak visual feature extraction methods with feedback from the reasoning component of the proposed model.<\/p>\n\n\n\n<p><em>From identification and measurement to mitigation:<\/em> Multimodality and open-vocabulary models not only facilitate metadata generation for characterizing model performance, but they also open new frontiers on model improvements. In particular, contrastive learning within a given modality and across modalities creates opportunities for directly guiding the optimization process to separate spurious features from target concepts. What\u2019s more exciting is that since it\u2019s possible to now tag instances with metadata or use information available in the caption, specifications of what is a spurious feature and whether it should be used to classify a target concept can be expressed in language. For example, in the mitigating spurious correlations paper, researchers use additional losses that specify to the optimization process that the word \u201cbaby\u201d and images that contain a \u201cbaby\u201d should be represented far in the representation space from \u201cpacifiers\u201d so that the model creates a more robust representation of individual objects (in this case, pacifiers). Similarly, they show it\u2019s possible to improve on more difficult benchmarks for spurious correlations such as the Waterbirds dataset, where land birds have been intentionally placed in water backgrounds and water birds placed in land backgrounds to study the impact of background spurious correlations on classification. They show that pretrained CLIP models (with a ResNet or transformer core) do index upon such features, but after adding these specifications through contrastive learning, they improve worst-group accuracy and focus on relevant concepts (see Figure 4).<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"670821\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Microsoft research newsletter<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/Newsletter_Banner_08_2019_v1_1920x1080.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Newsletter<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-newsletter\" class=\"large\">Stay connected to the research community at Microsoft.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button is-style-fill-chevron\">\n\t\t\t\t\t\t<a href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-describedby=\"microsoft-research-newsletter\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t\t\tSubscribe today\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"decomposing-evaluation-for-controllability-and-precision-in-multimodal-generation\">Decomposing evaluation for controllability and precision in multimodal generation&nbsp;<\/h2>\n\n\n\n<p><strong>Observation 3:<\/strong> General-purpose scores used in current benchmarks can\u2019t fully assess the controllability, or how much influence users have in getting the precise output they want, of generative capabilities. Assessing controllability requires new protocols that decompose evaluation by focusing on fundamental skills\u2014that is, those important across many scenarios.&nbsp;<\/p>\n\n\n\n<p>Since many foundation models, including multimodal ones, have been trained to complete generation tasks, often their main evaluation relies on scores such as Fr\u00e9chet inception distance (FID) or Inception Score. These scores are good indicators for measuring output quality. Output quality could be photorealism in the case of generated images or coherence in the case of generated captions, for example. But they don\u2019t reflect how well the generation captures important aspects of the input prompt. Without that information, it\u2019s difficult to determine a model\u2019s controllability. Other aspects of a generated image may be just as important as how \u201creal\u201d it looks. Consider spatial understanding. It\u2019s a fundamental subtask for a range of more complex tasks that require careful controllability, including language-guided tasks, object manipulation, navigation, and scene understanding. Misunderstandings here can be more than just frustrating; they can be impediments to productivity or detrimental in safety-critical applications. Drawing from the findings of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/benchmarking-spatial-relationships-in-text-to-image-generation\/\">&#8220;Benchmarking Spatial Relationships in Text-to-Image Generation,&#8221;<\/a> it&#8217;s evident that current text-to-image models often misinterpret spatial cues. Existing metrics like FID or even object accuracy don&#8217;t seem sensitive enough to flag these spatial errors.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"248\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6-1024x248.jpg\" alt=\"Figure 6: Examples of images generated with prompts that specify a spatial relationship, but the depicted relationship is not the correct one.\" class=\"wp-image-965199\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6-1024x248.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6-300x73.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6-768x186.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6-240x58.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_6.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 6: Examples of images generated with prompts that specify a spatial relationship.<\/figcaption><\/figure>\n\n\n\n<p><strong>Strategies for evaluation and model improvement<\/strong>: To account for spatial understanding, the study on benchmarking spatial relationships proposes <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/VISOR\" target=\"_blank\" rel=\"noopener noreferrer\">VerifyIng Spatial Object Relationships, or VISOR,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> a score that breaks down and conditions the evaluation of these capabilities into two parts:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Does the generated image contain all specified objects?&nbsp;&nbsp;<\/li>\n\n\n\n<li>Does the spatial configuration of the generated objects follow the spatial relationship specified in the prompt?<\/li>\n<\/ol>\n\n\n\n<p>For example, in the study, the model with the best object accuracy generation (DALL-E v2) can generate all specified objects 64 percent of the time, as scored by human annotators. Of those generations, relationships between objects are then accurate (as specified in the prompt) 59 percent of the time. From a user experience, this would mean that the model would fully generate what is specified in the prompt less than 40 percent of the time.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Beyond these evaluation results, the work proposes leveraging automated evaluation, which is generally challenging for complex tasks. But by breaking down evaluation into smaller tasks, the study found it was possible to use other forms of machine learning and computer vision for fine-grained automated evaluation. For example, in parallel to the human-annotated scores, the study used an automated version of VISOR. This automated version leverages an object detector to evaluate object accuracy and a bounding box localization technique to evaluate spatial relationships. As tasks become more complex, further decomposing the evaluation across microtasks becomes even more important and a promising direction ahead.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"241\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7-1024x241.jpg\" alt=\"Figure 7: A diagram showing how the VISOR score decomposes evaluation for spatial understanding into object detection and relationship evaluation using a photo of an elephant crossing a street behind a motorcycle as the example.\" class=\"wp-image-965202\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7-1024x241.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7-300x71.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7-768x180.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7-240x56.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_7.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 7: VISOR uses metric and task decomposition for evaluation by engaging other machine learning and computer vision techniques to first detect objects and then evaluate their spatial relationships.<\/figcaption><\/figure>\n\n\n\n<p>With a better understanding of model controllability, we can begin to develop methods for improving it. One pivotal aspect of refining multimodal models is their training data. For instance, since existing image captions used for training don\u2019t prioritize spatial relationships (often they\u2019re implied or not salient), one could use automated text data augmentation to generate alternative captions that specify spatial relationships (for example, \u201ca truck in front of a motorbike\u201d). Using a similar intuition, researchers behind <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/kosmos-2-grounding-multimodal-large-language-models-to-the-world\/\">&#8220;Kosmos-2: Grounding Multimodal Large Language Models to the World&#8221;<\/a> construct a large-scale dataset of grounded image-text pairs that also contain descriptions of object location. Kosmos-2,<strong> <\/strong>a new multimodal model<strong> <\/strong>trained on the dataset, exhibits higher accuracy on tasks that directly benefit from better grounding between modalities, such as navigation.&nbsp;<\/p>\n\n\n\n<p>The output of machine learning models, particularly generative ones, varies depending on factors such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/how-to\/completions\" target=\"_blank\" rel=\"noopener noreferrer\">generation temperature<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/prompt-engineering\" target=\"_blank\" rel=\"noopener noreferrer\">prompt engineering<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and inherent model stochasticity. While these are, of course, concrete practical challenges, the variability they offer can be leveraged to improve experiences and make evaluation more robust. For example, while the conditioned VISOR score for DALL-E v2 is 59 percent, when four images are generated, there exists at least one correct generation in that sample 74 percent of the time. When all of them are presented to users (common practice in interfaces), this increases users\u2019 chances of getting a satisfactory generation. Additionally, prompt variability is pervasive in most models where the interaction starts with language. An ablation experiment in the VISOR work shows that generation models tend to depict the object mentioned first in the prompt. Swapping objects in the prompt changes the relationship between them, adding another source of variability. In combination, these insights could be used for more effective interactions.<\/p>\n\n\n\n<p>While plain text and images are major modalities used in text-to-image models, model capabilities are also being evaluated when code is used for generating images and controlling generation. For example, several initial examples in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4\/\">&#8220;Sparks of Artificial General Intelligence: Early experiments with GPT-4&#8221;<\/a> illustrate how prompting an early version of GPT-4, a pure language model, to generate code in TikZ or JavaScript can lead to controllable drawings that depict spatial relationships more accurately. However, since these drawings can be rather simple, the study also seeds a conversation on how to get the best of both worlds: having good controllability through code and higher image quality or complex scenes through image generation. For example, it shows how it\u2019s possible to leverage a sketch initiated through code generation via GPT-4 to control the generation of a more complex scene via text-to-image models.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"452\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8-1024x452.jpg\" alt=\"Figure 8: An example of combining GPT-4 code generated sketches with image generation through Stable Diffusion. The example shows the process of drawing a terrain where there is a river from left to right, a desert with a pyramid below the river, and a city with many high rises above the river.\" class=\"wp-image-965205\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8-1024x452.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8-300x132.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8-768x339.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8-240x106.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_8.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 8: An example of combining GPT-4 code generated sketches with image generation through Stable Diffusion.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"beyond-offline-benchmarks-leveraging-adaptation-and-continual-learning-approaches\">Beyond offline benchmarks: Leveraging adaptation and continual learning approaches&nbsp;&nbsp;<\/h2>\n\n\n\n<p><strong>Observation 4: <\/strong>To bridge the gap between what offline measures can capture and the capabilities of models in the open world, researchers and developers must embrace adaptation and continual learning approaches, which come with challenges of their own.<strong>&nbsp;<\/strong>&nbsp;<\/p>\n\n\n\n<p>While offline evaluation provides a necessary view on how well models perform, they don\u2019t account for real-world variables such as the introduction of unseen object categories in the label space, new objects in the long tail of vision representations, user feedback, and differences between data in training and in the open world, including quality differences and differences in perspectives and orientation. Such concrete challenges are explored in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/continual-learning-about-objects-in-the-wild-an-interactive-approach\/\" target=\"_blank\" rel=\"noreferrer noopener\">&#8220;Continual Learning about Objects in the Wild: An Interactive Approach<\/a>&#8221; and \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/understanding-personalized-accessibility-through-teachable-ai-designing-and-evaluating-find-my-things-for-people-who-are-blind-or-low-vision\/#:~:text=Teachable%20AI%20systems%20give%20users,benefit%20of%20the%20personalization%20received.\" target=\"_blank\" rel=\"noreferrer noopener\">Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision<\/a>,\u201d two works that offer approaches for enabling the user to extend the capabilities of an AI system to meet their real-world needs. <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/taix\/\" target=\"_blank\" rel=\"noreferrer noopener\">Teachable AI systems<\/a>, as these approaches are called, allow users to provide examples or higher-level constraints to shape their experience with AI systems.<\/p>\n\n\n\n<p>The first paper unveils a practical mixed-reality approach for continual learning that includes multimodal models in the loop. In the presented implementation, the system tracks the 3D world position and orientation of objects in an environment via a mixed-reality headset worn by a user, who can provide labels by gazing toward an object highlighted by the system and saying, for example, \u201cThis is my cutting board.\u201d These interactions are designed to adapt a recognition model to improve performance over time on the set of objects encountered by the person using the system.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"382\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10-1024x382.jpg\" alt=\"Figure 10: An illustration of a mixed-reality system for interactive continual learning in the kitchen domain. Top left: a view through the mixed-reality headset of a cutting board highlighted and labeled by the system and various other kitchen objects. Top right: 3D object detection and localization. Bottom: a diverse mix of kitchen objects as seen from an egocentric perspective.\" class=\"wp-image-965211\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10-1024x382.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10-300x112.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10-768x286.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10-240x89.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/ResponsibleAI_2023Sep_figure_10.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 9: A mixed-reality system for interactive continual learning using a multimodal model in the loop.<\/figcaption><\/figure>\n\n\n\n<p>Building such a complex system requires answering difficult and unexplored evaluation questions. Some are specific to this application: how well does the system track object locations over time? Which objects should the system highlight and ask the user to label? How well do current state-of-the-art vision models perform in this setting? Others have broader implications for how we effectively evaluate these systems, including how we measure task completion.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Current evaluation methods aren\u2019t sufficient to answer such open questions. The evaluation of teachable AI systems, as described in the second paper, \u201cUnderstanding Personalized Accessibility Through Teachable AI,\u201d comprises a<strong> <\/strong>range of challenges beyond offline evaluation. \u201cUnderstanding Personalized Accessibility Through Teachable AI\u201d explores these challenges through Find My Things, a research prototype of an application for helping people who are blind or have low vision train an AI system to identify personal items by providing videos of the items, allowing them to later find those items using their phones. Among the conclusions: an AI system needs to help users collect quality teaching examples, work consistently across users, and work on a specific user\u2019s real-world data, not just \u201cclean\u201d data. Meanwhile, users need to understand what actions they can take to improve performance when a system is non-performant.<\/p>\n\n\n\n<p><strong>Strategies for evaluation and model improvement: <\/strong>Analyzing how a system performs when encountering data that isn\u2019t \u201cclean,\u201d specifically the impact of frame quality, \u201cContinual Learning about Objects in the Wild\u201d finds the CLIP model in a zero-shot setting is at least 10 percent less accurate on images that have some motion blur or occlusion. The result indicates that choosing the right frames for inference may indeed have a positive impact on user experience in zero-shot settings. However, even in the best case, these experiences have a lot of room for improvement on zero-shot recognition. The best model performance is less than 60 percent, even for frames that have been filtered to be without motion blur or occlusion. Similar findings are presented in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/hard-meta-dataset-towards-understanding-few-shot-performance-on-difficult-tasks\/\" target=\"_blank\" rel=\"noreferrer noopener\">&#8220;Hard-Meta-Dataset++: Towards Understanding Few-Shot Performance on Difficult Tasks,&#8221;<\/a> which presents a benchmark that specifically curates tasks that are difficult for the model to get right as a way to encourage model development that improves the worst case\/bottom line. Further, \u201cContinual Learning about Objects in the Wild\u201d experiments with model adaptation by fine-tuning a lightweight model on top of the base model, showing that continuous adaptation techniques hold promise for improving performance in real-world deployments.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Beyond accuracy, it will also be important to reduce the computational costs associated with adapting a model to new data. This is particularly important to realize interactive AI experiences that people can adapt or personalize themselves\u2014for example, teachable object recognizers as proposed by the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/orbit-dataset\/\" target=\"_blank\" rel=\"noreferrer noopener\">ORBIT benchmark<\/a>. This research shows that because of the computational cost and time to personalize a model to an individual\u2019s data, lighter-weight models that are less accurate would be better suited for the ultimate deployed experience than heavier-weight, more accurate ones.<\/p>\n\n\n\n<p>In conclusion, as we&#8217;ve navigated through a plethora of challenges and innovations, one message stands out: the road to effective multimodal AI systems built responsibly demands rigorous evaluation, an understanding of real-world complexities, and a commitment to continual improvement. We hope that these recent results will inspire ambitious work forward in the space of reframing the evaluation of multimodal models such that it properly captures their performance from initial evidence to rigorous benchmarks, complex skills, and eventually real-world and human-centered scenarios.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"related-reading\">Related reading<\/h2>\n\n\n\n<p id=\"literature\"><strong>Literature on multimodal models directly discussed in this blog<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mitigating-spurious-correlations-in-multi-modal-models-during-fine-tuning\/\">Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning<\/a>. Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman. ICML 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/social-biases-through-the-text-to-image-generation-lens\/\">Social Biases through the Text-to-Image Generation Lens<\/a>. Ranjita Naik, Besmira Nushi. AIES 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4\/\">Sparks of Artificial General Intelligence: Early Experiments with GPT-4<\/a>. S\u00e9bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang. Microsoft Research Tech Report 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/kosmos-2-grounding-multimodal-large-language-models-to-the-world\/\">Kosmos-2: Grounding Multimodal Large Language Models to the World<\/a>. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. Microsoft Research Tech Report 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/hard-meta-dataset-towards-understanding-few-shot-performance-on-difficult-tasks\/\">Hard-Meta-Dataset++: Towards Understanding Few-Shot Performance on Difficult Tasks<\/a>. Samyadeep Basu, Megan Stanley, John Bronskill, Soheil Feizi, Daniela Massiceti. ICLR 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/understanding-personalized-accessibility-through-teachable-ai-designing-and-evaluating-find-my-things-for-people-who-are-blind-or-low-vision\/#:~:text=Teachable%20AI%20systems%20give%20users,benefit%20of%20the%20personalization%20received.\">Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision<\/a>. Cecily Morrison, Rita Marques, Martin Grayson,&nbsp; Daniela Massiceti, Camilla Longden, Linda Yilin Wen,&nbsp; Ed Cutrell. ASSETS 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/taxonomizing-and-measuring-representational-harms-a-look-at-image-tagging\/\">Taxonomizing and Measuring Representational Harms: A Look at Image Tagging.<\/a> Jared Katzman, Angelina Wang, Morgan Scheuerman, Su Lin Blodgett, Kristen Laird, Hanna Wallach, Solon Barocas. AAAI 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-large-scale-robustness-analysis-of-video-action-recognition-models\/\">A Large-scale Robustness Analysis of Video Action Recognition Models.<\/a> Madeline Chantry Schiappa, Naman Biyani, Prudvi Kamtam, Shruti Vyas, Hamid Palangi, Vibhav Vineet, Yogesh Rawat. CVPR 2023.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/benchmarking-spatial-relationships-in-text-to-image-generation\/\" target=\"_blank\" rel=\"noreferrer noopener\">Benchmarking Spatial Relationships in Text-to-Image Generation<\/a>. Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang. Microsoft Research Tech Report 2022.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/measuring-representational-harms-in-image-captioning\/\" target=\"_blank\" rel=\"noreferrer noopener\">Measuring Representational Harms in Image Captioning<\/a>. Angelina Wang, Solon Barocas, Kristen Laird, Hanna Wallach. FAccT 2022.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/continual-learning-about-objects-in-the-wild-an-interactive-approach\/\">Continual Learning about Objects in the Wild: An Interactive Approach<\/a>. Dan Bohus, Sean Andrist, Ashley Feniello, Nick Saw, Eric Horvitz. ICMI 2022.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/disability-first-datasets\/\" target=\"_blank\" rel=\"noreferrer noopener\">Disability-first Dataset Creation: Lessons from Constructing a Dataset for Teachable Object Recognition with Blind and Low Vision Data Collectors<\/a>. Lida Theodorou, Daniela Massiceti, Luisa Zintgraf, Simone Stumpf, Cecily Morrison, Ed Cutrell, Matthew Tobias Harris, Katja Hofmann. ASSETS 2021.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/orbit-dataset\/\">ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition<\/a>. Daniela Massiceti, Luisa Zintgraf, John Bronskill, Lida Theodorou, Matthew Tobias Harris, Ed Cutrell, Cecily Morrison, Katja Hofmann, Simone Stumpf. ICCV 2021.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/squinting-at-vqa-models-introspecting-vqa-models-with-sub-questions\/\">SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions<\/a>. Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, Ece Kamar. CVPR 2020.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/neuro-symbolic-visual-reasoning-disentangling-visual-from-reasoning-2\/\">Neuro-Symbolic Visual Reasoning: Disentangling \u201cVisual\u201d from \u201cReasoning<\/a>.&#8221; Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, Kazuhito Koishida. ICML 2020.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p><strong>Literature on disaggregated evaluations<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/designing-disaggregated-evaluations-of-ai-systems-choices-considerations-and-tradeoffs\/\" target=\"_blank\" rel=\"noreferrer noopener\">Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs<\/a>. Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, Duncan Wadsworth, Hanna Wallach. AIES 2021.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/towards-accountable-ai-hybrid-human-machine-analyses-for-characterizing-system-failure\/\" target=\"_blank\" rel=\"noreferrer noopener\">Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure<\/a>. Besmira Nushi, Ece Kamar, Eric Horvitz. HCOMP 2018.&nbsp;<\/li>\n\n\n\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/understanding-failures-of-deep-networks-via-robust-feature-extraction\/\" target=\"_blank\" rel=\"noreferrer noopener\">Understanding Failures of Deep Networks via Robust Feature Extraction<\/a>. Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz. CVPR 2021.&nbsp;<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.youtube.com\/watch?v=NYXRrLzGiFk&t=734s&ab_channel=MicrosoftResearch\" target=\"_blank\" rel=\"noopener noreferrer\">Disaggregated model evaluation and comparison &#8211; YouTube<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>New evaluation methods and a commitment to continual improvement are musts if we\u2019re to build multimodal AI systems that advance human goals. Learn about cutting-edge research into the responsible development and use of multimodal AI at Microsoft.<\/p>\n","protected":false},"author":42735,"featured_media":965712,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13562,13554],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[264846],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-965166","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-human-computer-interaction","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560,199561,199565,199571,437514,992148],"msr_impact_theme":["Computing foundations"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144633,144931,283244,372368,606351],"related-projects":[917364,830104,644109,295553,389792],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"Responsible AI blog - hero graphic with connected circles with icons depicting closed captions, calendar, image, and document inside of the circles\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/Frontiers-Multimodal-Learning-BlogHeroFeature-1400x788-1.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"September 6, 2023","formattedExcerpt":"New evaluation methods and a commitment to continual improvement are musts if we\u2019re to build multimodal AI systems that advance human goals. Learn about cutting-edge research into the responsible development and use of multimodal AI at Microsoft.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/965166","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=965166"}],"version-history":[{"count":48,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/965166\/revisions"}],"predecessor-version":[{"id":1004514,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/965166\/revisions\/1004514"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/965712"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=965166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=965166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=965166"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=965166"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=965166"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=965166"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=965166"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=965166"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=965166"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=965166"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=965166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}