{"id":946080,"date":"2023-06-13T11:00:00","date_gmt":"2023-06-13T18:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=946080"},"modified":"2023-06-13T15:48:24","modified_gmt":"2023-06-13T22:48:24","slug":"accounting-for-past-imaging-studies-enhancing-radiology-ai-and-reporting","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/accounting-for-past-imaging-studies-enhancing-radiology-ai-and-reporting\/","title":{"rendered":"Accounting for past imaging studies: Enhancing radiology AI and reporting"},"content":{"rendered":"\n<p>The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich training signals without requiring manual labels so the models can learn to accurately recognize and locate findings in the images and relate them to content in radiology reports.<\/p>\n\n\n\n<p>Radiologists use radiology reports to describe imaging findings and offer a clinical diagnosis or a range of possible diagnoses, all of which can be influenced by considering the findings on previous imaging studies. In fact, comparisons with previous images are crucial for radiologists to make informed decisions. These comparisons can provide valuable context for determining whether a condition is a new concern or improving, deteriorating, or stable if an existing condition and can inform more appropriate treatment recommendations. Despite the importance of comparisons, current AI solutions for radiology often fall short in aligning images with report data because of the lack of access to prior scans. Current AI solutions also typically fail to account for the chronological progression of disease or imaging findings often present in biomedical datasets. This can lead to ambiguity in the model training process and can be risky in downstream applications such as automated report generation, where models may make up temporal content without access to past medical scans. In short, this limits the real-world applicability of such AI models to empower caregivers and augment existing workflows.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-to-exploit-temporal-structure-for-biomedical-vision-language-processing\/\" data-bi-cN=\"Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/making-the-most-of-text-semantics-to-improve-biomedical-vision-language-processing\/\">our previous work<\/a>, we demonstrated that multimodal self-supervised learning of radiology images and reports can yield significant performance improvement in downstream applications of machine learning models, such as detecting the presence of medical conditions and localizing these findings within the images. In our latest study, which is being presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cvpr2023.thecvf.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">2023 IEEE\/CVF Computer Vision and Pattern Recognition Conference (CVPR)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we propose <em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-to-exploit-temporal-structure-for-biomedical-vision-language-processing\/\">BioViL-T<\/a><\/em>, a self-supervised training framework that further increases the data efficiency of this learning paradigm by leveraging the temporal structure present in biomedical datasets. This approach enables the incorporation of temporal information and has the potential to perform complementary self-supervision without the need for additional data, resulting in improved predictive performance.<\/p>\n\n\n\n<p>Our proposed approach can handle missing or spatially misaligned images and can potentially scale to process a large number of prior images. By leveraging the existing temporal structure available in datasets, BioViL-T<em> <\/em>achieves state-of-the-art results on several downstream benchmarks. We&#8217;ve made both <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/biovil-t-model\" target=\"_blank\" rel=\"noopener noreferrer\">our models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/biovil-t-code\" target=\"_blank\" rel=\"noopener noreferrer\">source code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> open source, allowing for a comprehensive exploration and validation of the results discussed in our study. We\u2019ve also released a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/physionet.org\/content\/ms-cxr-t\/1.0.0\/\" target=\"_blank\" rel=\"noopener noreferrer\">new multimodal temporal benchmark dataset, MS-CXR-T<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, to support further research into longitudinal modeling of medical images and text data.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1141385\">\n\t\t\n\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/ai.azure.com\/labs\" aria-label=\"Azure AI Foundry Labs\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Azure-AI-Foundry_1600x900.jpg\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Azure AI Foundry Labs<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"azure-ai-foundry-labs\" class=\"large\">Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/ai.azure.com\/labs\" aria-describedby=\"azure-ai-foundry-labs\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t\t\tAzure AI Foundry\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"connecting-the-data-points\">Connecting the data points<\/h2>\n\n\n\n<p>Solving for the static case in vision-language processing\u2014that is, learning with pairs of <em>single<\/em> images and captions\u2014is a natural first step in advancing the field. So it\u2019s not surprising that current biomedical vision-language processing work has largely focused on tasks that are dependent on features or abnormalities present at a single point in time\u2014what is a patient\u2019s current condition, and what is a likely diagnosis?\u2014treating image-text pairs such as x-rays and corresponding reports in today\u2019s datasets as independent data points. When prior imaging findings are referenced in reports, that information is often ignored or removed in the training process. Further, a lack of publicly available datasets containing longitudinal series of imaging examinations and reports has further challenged the incorporation of temporal information into medical imaging benchmarks.<\/p>\n\n\n\n<p>Thanks to our early and close collaboration with practicing radiologists and our long-standing work with Nuance, a leading provider of AI solutions in the radiology space that was <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/news.microsoft.com\/2022\/03\/04\/microsoft-completes-acquisition-of-nuance-ushering-in-new-era-of-outcomes-based-ai\/\" target=\"_blank\" rel=\"noopener noreferrer\">acquired by Microsoft in 2022<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we\u2019ve been able to better understand clinician workflow in the radiological imaging setting. That includes how radiology data is created, what its different components are, and how routinely radiologists refer to prior studies in the context of interpreting medical images. With these insights, we were able to identify temporal alignment of text across multiple images as a clinically significant research problem. To ground, or associate, report information such as \u201cpleural effusion has improved compared to previous study\u201d with the imaging modality requires access to the prior imaging study. We were able to tackle this challenge <em>without <\/em>gathering additional data or annotations.<\/p>\n\n\n\n<p>As an innovative solution, we leveraged the metadata from de-identified public datasets like <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/physionet.org\/content\/mimic-cxr\/2.0.0\/\" target=\"_blank\" rel=\"noopener noreferrer\">MIMIC-CXR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This metadata preserves the original order and intervals of studies, allowing us to connect various images over time and observe disease progression. Developing more data efficient and smart solutions in the healthcare space, where data sources are scarce, is important if we want to develop meaningful AI solutions.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1104\" height=\"643\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance_BioViLt_diagram_cropped.gif\" alt=\"An animated flowchart of BioViL-T. Arrows direct from a prior chest x-ray and current chest x-ray  through boxes labeled \u201cCNN\u201d to image embeddings, illustrated by a purple cube and a brown cube, respectively, representing relevant spatial and temporal features. An arrow points from these features through a box labeled \u201cVision Transformer Blocks\u201d to a \u201cdifference embedding,\u201d represented by a blue cube. A curly bracket pointing to a brown and blue cube labeled \u201cimage features\u201d indicates the aggregation of the current image embedding and the difference embedding. Arrows from the \u201cimage features\u201d cube and from an extract from a radiology report point to a text model, represented by box labeled \u201cCXR-BERT.\u201d \" class=\"wp-image-946818\"\/><figcaption class=\"wp-element-caption\">Figure 1: The proposed self-supervised training framework BioViL-T leverages pairs of radiology reports and sequences of medical images. The training scheme does not require manual expert labels and can scale to a large amount of radiology data to pretrain image and text models required for downstream clinical applications.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"addressing-the-challenges-of-longitudinal-analysis\">Addressing the challenges of longitudinal analysis<\/h2>\n\n\n\n<p>With current and prior images now available for comparison, the question became, <em>how can a model reason about images coming from different time points<\/em>? Radiological imaging, especially with planar techniques like radiographs, may show noticeable variation. This can be influenced by factors such as the patient\u2019s posture during capture and the positioning of the device. Notably, these variations become more pronounced when images are taken with longer time gaps in between. To manage variations, current approaches to longitudinal analysis, largely used for fully supervised learning of image models only, require extensive preprocessing, such as image registration, a technique that attempts to align multiple images taken at different times from different viewpoints. In addition to better managing image variation, we wanted a framework that could be applied to cases in which prior images weren\u2019t relevant or available and the task involved only one image.<\/p>\n\n\n\n<p>We designed BioViL-T with these challenges in mind. Its main components are a multi-image encoder, consisting of both a vision transformer and a convolutional neural network (CNN), and a text encoder. As illustrated in Figure 1, in the multi-image encoder, each input image is first encoded with the CNN model to independently extract findings, such as opacities, present in each medical scan. Here, the CNN counteracts the large data demands of transformer-based architectures through its efficiency in extracting lower-level semantic features.<\/p>\n\n\n\n<p>At the next stage, the features across time points are matched and compared in the vision transformer block, then aggregated into a single joint representation incorporating both current and historical radiological information. It\u2019s important to note that the transformer architecture can adapt to either single- or multi-image scenarios, thereby better handling situations in which past images are unavailable, such as when there\u2019s no relevant image history. Additionally, a cross-attention mechanism across image regions reduces the need for extensive preprocessing, addressing potential variations across images. <br><br>In the final stage, the multi-image encoder is jointly trained with the text encoder to match the image representations with their text counterparts using masked modeling and contrastive supervision techniques. To improve text representations and model supervision, we utilize the domain-specific text encoder <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/BiomedVLP-CXR-BERT-general\" target=\"_blank\" rel=\"noopener noreferrer\">CXR-BERT-general<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which is pretrained on clinical text corpora and built on a clinical vocabulary.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"597\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance_rollout_attention_gif.gif\" alt=\"Two chest x-rays side-by-side animated with bounding boxes and attention maps on the affected area of the lung.  \" class=\"wp-image-946827\"\/><figcaption class=\"wp-element-caption\">Figure 2: Example of current (left) and prior (right) chest x-ray scans. The attention maps computed within the vision transformer show (in purple) how the model interprets disease progression by focusing on these image regions. In this example, the airspace disease seen in the left lung lobe has improved since the prior acquisition.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"grounded-model-prediction\">Grounded model prediction<\/h2>\n\n\n\n<p>In our work, we found that linking multiple images during pretraining makes for both better language and vision representations, enabling the AI model to better associate information present in both the text and the images. This means that when given a radiology report of a chest x-ray, for example, with the description \u201cincreased opacities in the left lower lung compared with prior examination,\u201d a model can more accurately identify, locate, and compare findings, such as opacities. This improved alignment between data modalities is crucial because it allows the model to provide more accurate and relevant insights, such as identifying abnormalities in medical images, generating more accurate diagnostic reports, or tracking the progression of a disease over time<em>.<\/em><\/p>\n\n\n\n<p>Two findings were particularly insightful for us during our experimentation with BioViL-T:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Today\u2019s language-generating AI models are often trained by <em>masking<\/em> portions of text and then prompting them to fill in the blanks as a means of encouraging the models to account for context in outputting a prediction. We extended the traditional masked language modeling (MLM) approach to be guided by multi-image context, essentially making the approach multimodal. This, in return, helped us better analyze whether BioViL-T was learning a progression based on provided images or making a random prediction of the masked words based solely on the text context. We gave the model radiology images and reports with progression-related language, such as \u201cimproving,\u201d masked. An example input would be \u201cpleural effusion has been [MASKED] since yesterday.\u201d We then tasked the model with predicting the missing word(s) based on single and multi-image inputs. When provided with a single image, the model was unsuccessful in completing the task; however, when provided with a current and prior image, performance improved, demonstrating that the model is basing its prediction on the prior image.<\/li>\n\n\n\n<li>Additionally, we found that training on prior images decreases instances of the generative AI model producing ungrounded outputs that seem plausible but are factually incorrect, in this case, when there\u2019s a lack of information. Prior work into radiology report generation utilizes single input images, resulting in the model potentially outputting text that describes progression without having access to past scans. This severely limits the potential adoption of AI solutions in a high-stakes domain such as healthcare. A decrease in ungrounded outputs, however, could enable automated report generation or assistive writing in the future, which could potentially <a href=\"https:\/\/www.microsoft.com\/en-us\/industry\/blog\/healthcare\/2022\/03\/04\/microsoft-and-nuance-supporting-the-resilience-of-healthcare\/\" target=\"_blank\" rel=\"noreferrer noopener\">help reduce administrative duties and ease burnout in the healthcare community<\/a>. Note that these models aren\u2019t intended for any clinical use at the moment, but they\u2019re important proof points to assess the capabilities of healthcare AI.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"moving-longitudinal-analysis-forward\">Moving longitudinal analysis forward<\/h2>\n\n\n\n<p>Through our relationships with practicing radiologists and Nuance, we were able to identify and concentrate on a clinically important research problem, finding that accounting for patient history matters if we want to develop AI solutions with value. To help the research community advance longitudinal analysis, we\u2019ve released a new benchmark dataset. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/physionet.org\/content\/ms-cxr-t\/1.0.0\/\" target=\"_blank\" rel=\"noopener noreferrer\">MS-CXR-T<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which was curated by a board-certified radiologist, consists of current-prior image pairs of chest x-rays labeled with a state of progression for the temporal image classification task and pairs of sentences about disease progression that are either contradictory or capture the same assessment but are phrased differently for the sentence similarity task.<\/p>\n\n\n\n<p>We focused on chest x-rays and lung diseases, but we see our work as having the potential to be extended into other medical imaging settings where analyzing images over time plays an important part in clinician decision-making, such as scenarios involving MRI or CT scans. However far the reach, it\u2019s vital to ensure that models such as BioViL-T generalize well across different population groups and under the various conditions in which medical images are captured. This important part of the journey requires extensive benchmarking of models on unseen datasets. These datasets should widely vary in terms of acquisition settings, patient demographics, and disease prevalence. Another aspect of this work we look forward to exploring and monitoring is the potential role of general foundation models like GPT-4 in domain-specific foundation model training and the benefits of pairing larger foundation models with smaller specialized models such as BioViL-T.<\/p>\n\n\n\n<p>To learn more and to access our text and image models and source code, visit the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/BiomedVLP-BioViL-T\" target=\"_blank\" rel=\"noopener noreferrer\">BioViL-T Hugging Face page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/hi-ml\/tree\/main\/hi-ml-multimodal\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/huggingface.co\/microsoft\/BiomedVLP-BioViL-T\" target=\"_blank\" rel=\"noreferrer noopener\">BioViL-T models<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/hi-ml\/tree\/main\/hi-ml-multimodal\" target=\"_blank\" rel=\"noreferrer noopener\">BioViL-T code<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgments\">Acknowledgments<\/h2>\n\n\n\n<p>We\u2019d like to thank our co-authors: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shbannur\/\">Shruthi Bannur<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sthyland\/\">Stephanie Hyland<\/a>, Qianchu Liu<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fperezgarcia\/\">, Fernando P\u00e9rez-Garc\u00eda<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/maxilse\/\">Maximilian Ilse<\/a>, Daniel C. Castro, Benedikt Boecking, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/harssharma\/\">Harshita Sharma<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/t-kbouzid\/\">Kenza Bouzid<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/anthie\/\">Anja Thieme<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/antonsc\/\">Anton Schwaighofer<\/a>, Maria Wetscherek, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adityan\/\">Aditya Nori<\/a>. We\u2019d also like to thank <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hoifung\/\">Hoifung Poon<\/a>, Melanie Bernhardt, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mebristo\/\">Melissa Bristow<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/naotous\/\">Naoto Usuyama<\/a> for their valuable technical feedback and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hamurfet\/\">Hannah Richardson<\/a> for assisting with compliance reviews.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"medical-device-disclaimer\">MEDICAL DEVICE DISCLAIMER<\/h2>\n\n\n\n<p>BioViL-T was developed for research purposes and is not designed, intended, or made available as a medical device and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich training signals without requiring manual labels so the models can [&hellip;]<\/p>\n","protected":false},"author":42735,"featured_media":948672,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Ozan Oktay","user_id":"38706"},{"type":"user_nicename","value":"Javier Alvarez-Valle","user_id":"32137"},{"type":"user_nicename","value":"Matthew Lungren","user_id":"42792"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13553],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[264846,261673],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-946080","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[849856],"msr_impact_theme":["Computing foundations","Health"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[780706],"related-projects":[978063,855669],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"BioViL-T sequence diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/06\/Nuance-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Ozan Oktay, Javier Alvarez-Valle, and Matthew Lungren","formattedDate":"June 13, 2023","formattedExcerpt":"The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/946080","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=946080"}],"version-history":[{"count":28,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/946080\/revisions"}],"predecessor-version":[{"id":948648,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/946080\/revisions\/948648"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/948672"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=946080"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=946080"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=946080"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=946080"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=946080"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=946080"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=946080"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=946080"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=946080"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=946080"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=946080"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}