Lessons from deploying HealthBots with experts-in-the-loop

March 3, 2026
Anson Ho, Microsoft; Mohit Jain, Microsoft
Microsoft Research Forum | Season 2, Episode 3

The Tyger framework enables faster, more accessible medical imaging by streaming raw data to the cloud for accelerated reconstruction—reducing patient wait times and discomfort—while empowering researchers to rapidly test and deploy new algorithms.

Explore more

ASHABot discussed in Satya Nadella’s keynote (opens in new tab)
ASHABot: An LLM-Powered Chatbot to Support the Informational Needs of Community Health Workers (opens in new tab)
CataractBot: An LLM-Powered Experts-in-the-Loop Chatbot for Cataract Patients (opens in new tab)

All Research Forum sessions

Transcript

Lessons from deploying HealthBots with experts-in-the-loop

In many parts of the world access to healthcare isn’t limited by technology, it is limited by access to human expertise.

Mohit Jain from MSR India will share health bots with expert in the loop systems that combine LLM generated responses with fast expert review deployed through familiar tools like WhatsApp. These systems are already helping patients and frontline healthcare workers get timely, trusted guidance at scale.

This is a great example of research advancing AI for all. Using AI to extend human expertise, not replace it. Mohit, over to you.

Hey. Hi, everyone. I’m Mohit Jain. I’m a principal researcher at the Microsoft Research Labs in India. Today I’ll be talking about the lesson that I learned while deploying health bots with experts in the loop. Let’s get started. So patients, typically those who are undergoing critical treatment, let’s say a surgery or cancer treatment, require timely, trustworthy and precise medical information.

However, in a country like India with only seven doctors for every 10,000 people. It is very hard for a medical professional to spend a lot of time with every patient, which negatively impacts their information sharing with these patients.

From a doctor’s perspective, questions are fairly similar across patients. For example, if somebody who is undergoing a cataract surgery or who is going through head and neck cancer treatment might ask, when can I start eating biryani after surgery? Or when can I drive my car again? With the advent of large language models, we somehow assume that it will help us make healthcare more accessible.

However, as we all know, they do make errors and they do hallucinate. And because of that. Even organizations like OpenAI and anthropic clearly state that in their usage policy that for critical settings like healthcare, finance or legal, we should not completely rely on large language models. So the main research question that we are trying to answer in this work was that can experts, in this case medical professional, help us improve the current AI systems and in return, can AI system help these medical professionals to attend more and more patients?

So to answer these research questions, we developed multiple chatbots. One is for patients who are undergoing cataract surgery at an eye hospital. So let’s see how it actually works in action. So in this case, the chatbot completely runs on WhatsApp so that there is no requirement to download a new software.

Okay. And also it is most accessible in a country like India. The patient can ask a question in any of the languages that they are comfortable in, and they can even send a voice message. So in this case, they could have easily asked the same question by sending a voice message in Hindi. Here, for the sake of the broader audience, let’s say the patient asked this question in English when can I wash my hair after the cataract surgery?

We use a knowledge base that has been curated by the doctors of that eye hospital to generate this answer. Which state you can wash your hair three days after the surgery. However, it comes with a question mark because the answer has still not been verified or vetted by a Doctor. We also need a few related questions so that the patient can continue this conversation

At the same time when this answer comes to the patient. The answer is also sent to the doctor who is going to operate on that patient. And along with the question answer pair. The doctor also receives this simple question was the bots answer correct and complete. And the doctor has three options. It can.

They can say yes, there is a correct answer. They can say no. Or they can even send the question to delegate the question to a patient coordinator to answer it. Because we are doing some kind of classification in the backend, whether it is a medical question, then it goes to a doctor. If it is a logistical question, it goes through a patient coordinator.

Let’s say in this case, the doctor says that the answer is incomplete. So that triggers a longer workflow. So we inform the patient upfront that the previously provided answer was invalid. And please wait for the corrected answer. Along with that, we asked the doctor please reply with the correction and hear the beauty of our solution.

The doctor is not required to manually edit the answer. They have to just provide a feedback which can be grammatically incorrect, which can be incomplete within completely informal. In this case, the doctor could have just written two followed with a space and W case, and we would have got it that the doctor is saying that they have to wait for two weeks and not two days.

However, that answer is not sent to the patient. The patient receives a more formal version of the answer, saying that better to avoid washing your hair for two weeks after the surgery. And it also comes with a green tick mark and with the statement this is a verified answer by your doctor.

Now we have this really important data from a expert, a medical expert. So we use that data to update the knowledge base so that the next time if another patient asks a similar question, let’s say a different patient or this question surgery is done. Can I shampoo? Which is pretty much the same question but in a different way.

So we said we inform the patient that previously answer was provided to a similar question. And does this answer your question? Yes or no. So this pre verified answer reduces the expert’s workload over a period of time.

Now this system has evolved into a broader framework and we call that. Build your own expert bot which is available as an open-source tool on the Microsoft GitHub page. So anybody can build this expert in the loop chat bot. We use the same platform to build three of these deployment. One as I told you about for patients who are undergoing cataract surgery at the Sankara Eye Hospital.

This has been running at three hospitals of Sankara Eye in Bangalore, Hyderabad and Jaipur, supporting different languages and different socio economic demography of people. More than 3000 patients have used it in the last 9 to 12 months, and more than 10,000 questions have been answered and more than 20 doctors are actively verifying these answers.

For community health workers, we have deployed it with the organization called as Khushi Baby. Right now it is actively working in five different districts in Maharashtra and Rajasthan state of India. More than 11,000 ASHA workers, who are the community health workers in India have used it, and more than 50,000 questions have been answered in the community health worker scenario.

The experts are not doctors, they are actually ASHA supervisors who are the auxiliary nurse midwives and hence, instead of a single doctor in this case, a pool of these ASHA supervisors are the ones who actually get every answer. In case of the third deployment is for patients who are undergoing head and neck cancer surgery at the at the cost of a medical hospital, and that system is called as OncoBot.

So through all of these deployment, we have we have learned multiple lessons. And the most important ones I am putting it here. The first one is that expert verification. Whenever we say that your doctor doctor, let’s say Kaushik has verified your answer, it actually adds a lot of trust to those answers.

So it’s very, very important for us to add this expert in the loop. And also we need to make sure that the workload on the doctor is minimal, and hence that all that shorthand, grammatically incorrect correction is what it needs to make the system work. Also, the system is completely asynchronous. The doctor doesn’t have to respond in real time. They can respond whenever they are taking a coffee break or something else.

Second is that compared to other kind of system where we we really enjoy the creativity of a generative AI system. In our case, reliability bleeds that creativity because the constraint static AI relying on the RAC pipeline actually outperform this fully generic system wherein we can’t control for the kind of output it will generate, and that also act as the automatic guardrail to the system.

The third point is that admitting uncertainty, in our case, whenever the bot is unaware or is not sure of an answer, it just says, I don’t know. Okay. For example, if a patient would have asked like what should I do in case? Can I do pranayama after my surgery? Which is a kind of foundational yoga, but the knowledge base doesn’t have that information, so the bot will just respond saying that I do not know, and the answer will be then given by the doctor.

The fourth is that localization is not translation. A typical Azure Language technology translation works really well well for certain languages, but for local Indic languages like Hindi or Tamil or Telugu, it doesn’t work that well because it doesn’t have the domain culture or the contextual knowledge.

So instead of that, we start relying on GPT 2 because we can always provide that context, that this is a question asked by a ASHA worker. For example, when Asha worker asks about the side effects of Antara tika, which is a kind of a contraceptive. Uh, the the translation from the Azure Language technology was Antarctica.

Okay? However, when actually we give this information, this is the question asked by Asha worker, we get the appropriate translation. And finally, in the current ecosystem healthcare ecosystem of India, there is a huge power difference between a doctor and a patient, even between a supervisor and a community health worker.

And they are always very hesitant to ask very dumb or simple or basic questions to them. However, because there is a chatbot which is sitting in between the two end users and the chatbot is non-judgmental, and that actually makes the patient or the end user like a community health worker, gives them more power to ask very fundamental or basic questions without getting judged.

That’s all from me. Thank you.