Research Forum | Episode 2 - abstract chalkboard background

Research Forum Brief | March 2024

Multimodal Generative AI: the Next Frontier in Precision Health

Share this page

Hoifung Poon

“GenAI can potentially unlock a slew of high-value applications, from improving patient care to accelerating drug development and clinical discovery, to the ultimate dream of precision health: predicting medical events.”

Hoifung Poon, General Manager, Microsoft Research Health Futures

Multimodal Generative AI: the Next Frontier in Precision Health

By Hoifung Poon

The dream of precision health is to prescribe the right intervention for the right patient at the right time. We are still far from attaining this dream, with cancer being the poster child of the challenges we still face. Despite all the progress medical science has achieved in treating cancer, the standard of care often fails, with the majority of patients not responding to their prescribed treatment.

The confluence of technological advances and social policies has led to the rapid digitization of multimodal, longitudinal patient journeys, such as electronic medical records (EMRs), imaging, and multiomics (i.e., a type of biological analysis that uses multiple “omes”—the genome, epigenome, microbiome, and so on—as datasets). Each modality conveys only limited information about the patient, like a blind person touching one small part of an elephant and trying to describe the whole animal. By synthesizing all relevant modalities, however, we can create a holistic view of the patient.

The availability of such multimodal real-world data enables pretraining of powerful patient embedding, which can serve as a digital twin for the patient. In turn, this high-fidelity patient embedding enables patient-like-me reasoning at scale, which can help to improve patient care by identifying what works and accelerate discovery by pinpointing exactly where and how today’s medicines don’t work. Such real-world evidence (RWE) represents emergent capabilities, which come from assimilating population-scale real-world data and go far beyond the competency of today’s frontier models.

This is exciting, but progress is difficult. Even for table-stake medical technologies, such as two-dimensional (2D) X-rays, existing multimodal frontier models show a large competency gap. Meanwhile, three-dimensional (3D) imaging, such as computerized tomography (CT) and magnetic resonance imaging (MRI), is underexplored, and digital pathology is enormous compared to web images. If we printed a whole digital slide image at the standard printer resolution, it would cover a tennis court. At the cutting edge, emerging modalities such as genomics and spatial transcriptomics (i.e., a molecular profiling method that allows researchers to measure all gene activity in a tissue sample and map it to individual cells) are progressing quickly, with rapidly evolving scale and adoption.

Beyond individual modalities, the challenges multiply even further given the combinatorial explosion (i.e., the rapid growth of possibilities or combinations that researchers must consider when solving a problem). This can be likened to a multimodal “Tower of Babel” situation. We no longer have simplistic scaling laws pertaining to one-dimensional model size and training tokens. Instead, we need to factor in both unimodal and cross-modal data points across all modalities and their combinations.

In translation, the multilingual complexities are often tackled by grounding them in a resource-rich “interlingua” (an intermediary language) such as English. Similarly, in multimodal generative AI (GenAI), text can serve as the 80/20 interlingua modality to drastically simplify learning. Frontier models such as GPT-4 already provide a solid foundation for interpreting biomedical text and assimilating a good portion of public knowledge. Moreover, the study of any modality typically involves natural language. Thus, data is often accompanied by a co-occurring textual description (for example, research literature proves to be a rich source of biomedical multimodal data, such as image-text pairs). At Microsoft Research, we have curated the largest biomedical multimodal dataset from public sources (opens in new tab), with 46 million image-text pairs extracted from millions of papers in PubMed Central. Multimodal real-world data, such as medical images and reports, is even more abundant.

To tackle the multimodal complexities in precision health, we propose a modular approach by factoring patient embedding into unimodal pretraining and cross-modal learning. For each modality, we can train an encoder and decoder to map it to embedding and back. Such pretraining can be conducted independently for each modality by leveraging modality-specific self-supervision, such as masked language modeling for text and DINO (self-DIstillation with NO label) for images. For the text modality, we can piggyback on frontier models or state-of-the-art small language models (SLMs). The encoder and decoder can be the same, as in GPT-4. For cross-modal learning, we introduce a modality-specific adapter, which can serve as a projection layer to “translate” the given modality into the text space. Of course, current text embedding doesn’t capture everything, especially things yet to be discovered (think COVID before 2020). Nevertheless, text still serves as a strong beachhead and can be updated by continued pretraining and fine-tuning.

LLaVA-Med (opens in new tab) shows how this general recipe might work in practice, using image-text as an example. It adopts a modular design, where the vision encoder and text decoder can be plug-and-play from any pretrained models. The hypothesis is that unimodal pretraining already removes a large number of superficial variations. As a result, learning can be very data efficient, focusing on the lightweight adapter, such as a linear layer or a simple multilayer perceptron (MLP). Another key idea about LLaVA-Med is to leverage frontier models (specifically GPT-4) to synthesize multimodal instruction-following data. Given an image-text pair, we take the gold text and ask GPT-4 to generate simulated conversations about the image, using only information from the text. Then, for each generated question-answer pair, we add back the image to form the image, question, answer triad for multimodal instruction-tuning. In this way, GPT-4 can generate a huge amount of multimodal instruction-following data from the original image-text pairs.

We have applied the LLaVA-Med recipe to multimodal patient data, such as radiology image-report pairs, demonstrating substantial improvement over existing frontier models in standard tasks such as identifying key findings from radiology images. The same recipe can also be applied to in-silico imaging by adding an image decoder, as shown in BiomedJourney (opens in new tab). Specifically, BiomedJourney takes consecutive radiology image-report pairs from a patient journey, uses GPT-4 to summarize changes, and then leverages the before image, progression text, after image triad for multimodal instruction-tuning. Given a prior image and the hypothetical progression, BiomedJourney can generate a counterfactual image reflecting the changes.

For digital pathology, the enormous slide size translates into a context of up to a million tokens, which would blow-up self-attention in a transformer model. We have explored advanced techniques such as dilated attention to circumvent such limitations. In joint work with Providence researchers, we have trained real-world pathology foundation models from hundreds of thousands of slides along with their clinical reports, with promising results in pathomics and progression modeling.

By learning multimodal and longitudinal patient embedding from population-level real-world data, multimodal GenAI can potentially unlock a slew of high-value applications, from improving patient care to accelerating drug development and clinical discovery, to the ultimate dream of precision health: predicting next medical events, such as longitudinal disease progression and treatment outcome, as in real-world evidence (RWE).

The multimodal GenAI research work described in this essay stems from collaboration across Microsoft Research, Azure AI, and HLS S&P (Nuance), and includes key collaborators such as Jianfeng Gao, Mu Wei, Matt Lungren, Houdong Hu, Hany Awadalla, Furu Wei, Tao Qin and team.