Sensitive information such as healthcare data is often siloed within health organization boundaries. This has posed a challenge to machine learning models used by the health and life sciences industry that require data for training purposes. To improve patient care and accelerate health industry progression, the Microsoft Health & Life Sciences AI group used a federated learning setup to train their biomedical natural language processing service, Text Analytics for Health, while preserving the trust boundaries of siloed data. The federated learning framework was built using Microsoft Azure Machine Learning and open-source technologies to help organizations analyze siloed data and build new applications without compromising data privacy.
“We built Text Analytics for Health to empower healthcare organizations to leverage their clinical data, to provide better care to patients. The Azure Machine Learning–based federated learning approach we used was critical for training our models on real-world documents specific to the Health & Life Sciences industry.”
Hadas Bitran, Partner Group Manager, Microsoft Health & Life Sciences
Drawing connections from documents stored in silos
Consider how data is typically captured in the healthcare industry. Doctors scribble or dictate notes that are later transcribed, or they may type them into a computer directly during a checkup or consultation. Traditionally, these notes then sit in digital databases as unstructured text, untapped until the next appointment.
But what if AI and cloud technology could help organizations analyze that data, interpreting what it means? What if they could then draw connections between unstructured patient data and healthcare records? This could enable use cases such as trend analysis—creating predictions based on broader data, and even scalable matching to clinical trials—quickly identifying eligibility for cutting-edge treatments.
That’s exactly what the Microsoft Health & Life Sciences AI group envisioned and delivered in the form of Text Analytics for Health. Partnering with Microsoft Research (MSR) and the Microsoft AIMS (Azure AI Platform Incubations, Machine Learning and Data Science) team for federated learning tools and the Azure Machine Learning team to incorporate Microsoft Azure Cognitive Services, the team spent two years developing and refining a natural language processing approach to analyzing medical texts.
“We’ve seen a need for a service that helps customers analyze and derive insights from unstructured clinical data such as doctors’ notes, radiology narratives, and any text in the electronic medical records,” says Hadas Bitran, Partner Group Manager from the Health & Life Sciences AI group at Microsoft. “We built Text Analytics for Health to empower healthcare organizations to leverage their clinical data, to provide better care to patients. The Azure Machine Learning–based federated learning approach we used was critical for training our models on real-world documents specific to the Health & Life Sciences industry.”
What’s under the hood
The Health & Life Sciences AI group built Text Analytics for Health using PyTorch, an open-source machine learning framework, because it’s ideally suited to natural language processing.
Text Analytics for Health provides support for:
- Named entity recognition—detects words and phrases mentioned in unstructured text that can be associated with one or more semantic types, such as diagnosis, medication name, symptom/sign, or age.
- Entity linking—disambiguates distinct entities by associating named entities mentioned in text to concepts found in a predefined database of concepts, including the Unified Medical Language System (UMLS). Medical concepts are also assigned preferred naming, as an additional form of normalization.
- Assertions—identifies meaningful connections between concepts mentioned in text. Examples include finding a “time of condition” relation by associating a condition name with a time or connecting an abbreviation with the full description.
- Relation extraction—the semantic relation between concepts is captured. Examples include connecting an examination with its results or prescription drug with its frequency and amount.
It’s important to recognize that private patient data and proprietary corporate data never leave the customer’s internal networks.
The Health & Life Sciences AI group used Azure Machine Learning to train the service in a federated learning setup and deploy to production, following this Federated Learning cookbook developed by the Microsoft AIMS team. This was developed with the help of MSR’s FLUTE, which simulated federated learning experiments during the prototype phase and inspired the development of the Federated Learning cookbook for production.
Typically for this type of training, data would be centralized into a single environment. The federated model permits processing to occur only on the client side, with only aggregated and untraceable data delivered back to the central server.
To train the model, multiple data partnerships were created, and a federated learning system was implemented across each partner’s environment, enabling siloed training on the partner’s data. Prior to training, all data was de-identified by the partner. This enabled use of a wide range of training examples representing several languages, formats, medical fields, and expertise. The result is a service that understands the core entities so well that it often can infer the role of new terms, even when misspelled, from context.
“Within the data silo, the local model learns on its own data and updates to the central model, which aggregates differences across the various silos and creates a unified model while preserving trust boundaries,” notes Tal Baumel, Senior Applied Data Scientist from the Health & Life Sciences AI group at Microsoft.
“We’ve built a solution on federated learning that trains a machine learning model in a distributed manner without the need for data to leave its environment—so the data never left the partner’s trust boundaries,” says Baumel.
The output of this training process delivers models in PyTorch format. To deploy the PyTorch model to production, the team chose to use Open Neural Network Exchange (ONNX) Runtime, a high-performance inference engine for PyTorch. They leveraged PyTorch’s built-in support for exporting models in the ONNX format and then accelerated the model with ONNX Runtime. The solution of ONNX Runtime for PyTorch simplifies the process for meeting production performance and scale requirements.
For more information about building federated learning pipelines on Azure Machine Learning, Microsoft has developed a freely available Federated Learning cookbook, which guides you to train federated learning models for your own scenarios.
So that’s how federated learning works, but what can Text Analytics for Health do?
Text Analytics for Health enables use cases and applications that help customers leverage information and insights extracted from unstructured clinical text to drive value and improve patient care. One such application is clinical trials matching. Healthcare organizations can match patient data against trial criteria to quickly identify candidates. Likewise, patients can match their data against potential trials using information their physician already recorded to help identify relevant clinical trials. Because clinical trials depend on recruitment of eligible patients, adopting this type of AI-based solution can help speed the delivery of new medical treatments.
Another classic use case is analyzing doctors’ notes, as illustrated in this 2020 AI Show demo recorded during the pilot phase. The solution extracts entities from the notes across multiple categories, including patient diagnoses, any condition qualifiers (such as phase of illness), symptoms, body parts, any measurement values, and time. Most importantly, it can identify which values are negated, like symptoms the patient reports that they no longer have.
The team considers this just the first step toward unlocking the power of natural language processing for the medical domain. Anyone wanting to create their own custom solution based on this approach can begin with the Text Analytics for Health service.
Text Analytics for Health is available to the public, recently adding multilingual support beyond English in preview so the solution can process text in English, Spanish, French, Italian, Portuguese, German, and Hebrew.
“Within the data silo, the local model learns on its own data and updates to the central model, which aggregates differences across the various silos and creates a unified model while preserving trust boundaries.”
Tal Baumel, Senior Applied Data Scientist, Microsoft
Follow Microsoft