Utilizing consumer cameras for contact-free physiological measurement in telehealth and beyond


By , Principal Researcher , PhD student, University of Washington

a man sitting at a table using a laptop
Our research is enabling robust and scalable measurement of physiology. Cameras on everyday devices can be used to detect subtle changes in light reflected from the body caused by physiological processes. Machine learning algorithms are then used to process the camera images and recover the underlying pulse and respiration signals that can then be used for health and wellness tracking.

According to the CDC WONDER Online Database (opens in new tab), heart disease is currently the leading cause of death for both men and women in the United States. However, most deaths due to cardiovascular diseases could be prevented with suitable interventions. Early detection of changes in health and well-being can have a significant impact on the success of these interventions and boost the chances of positive outcomes. Atrial fibrillation (AFib) is an example of a symptom that can indicate increased risk of heart disease, and when detected early, it can inform interventions that help to reduce risk of stroke.

Physiological sensing plays an important role in helping people track their health and detect the onset of symptoms. However, there are barriers to conducting physiological sensing that act as a disincentive, such as access to medical devices and the inconvenience of performing regular measurements. Making physiological sensing more accessible and less obtrusive can reduce the burden on people to perform physiological assessments of this kind and help catch early warning signs of symptoms like AFib.

Over the past decade, researchers have discovered that increasingly available webcams and cellphone cameras combined with AI algorithms can be used as effective health sensors. These methods involve measurement of very subtle changes in the appearance of the body across time, in many cases changes imperceptible to the unaided human eye, to recover physiological information. In essence, as ambient light in a room hits your body, some is absorbed and some is reflected. Physiological processes such as blood flow and breathing change the appearance of the body very subtly over time.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.

A smartphone camera can pick up this reflected light, and the changes in pixel intensities over time can be used to recover the underlying sources of these variations (namely a person’s pulse and respiration). Using optical models grounded in our knowledge of these physiological processes, a video of a person can be processed to determine their pulse rate, respiration, and even the concentration of oxygen in their blood.

  • illustrated icons related to artificial intelligence for Microsoft's involvement at NeurIPS 2020 EVENT Microsoft at NeurIPS 2020  

    Check out Microsoft's presence at NeurIPS 2020, including links to all of our NeurIPS publications, the Microsoft session schedule, and links to open career opportunities.

Building on previous work, our team of researchers from Microsoft Research, University of Washington, and OctoML have collaborated to create an innovative video-based on-device optical cardiopulmonary vital sign measurement approach. The approach uses everyday camera technology (such as webcams and mobile devices) and a novel convolutional attention network, called MTTS-CAN, to make real-time cardio-pulmonary measurements possible on mobile platforms with state-of-the-art accuracy. Our paper, “Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (opens in new tab),” has been accepted at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) (opens in new tab) and will be presented in a Spotlight talk (opens in new tab) on Monday, December 7th at 6:15PM- 6:30PM (PT).

Camera-based physiological sensing applications in telehealth

Camera-based physiological sensing has numerous fitness, well-being and clinical applications. For everyday consumers, it could make home monitoring and fitness tracking more convenient. Imagine if your treadmill or smart at-home fitness equipment could continuously track your vitals during your run without you needing to wear a device or sync the data. In clinical contexts, camera-based measurements could enable a cardiologist to more objectively analyze a patient’s heart health over a video call. Contact sensors, necessary for monitoring vitals in intensive care, can damage the skin of infants—remote sensing could provide a more comfortable solution.

Perhaps the most obvious application for camera-based physiological sensing is in telehealth. The SARS-CoV-2 (COVID-19) pandemic is transforming the face of healthcare around the world. One example of this revolution can be seen in the number of medical appointments held via teleconference, which has increased by more than an order of magnitude because of stay-at-home orders and greater burdens on healthcare systems. This is due to the desire to protect healthcare workers and restrictions on travel (opens in new tab), but telehealth also benefits patients by saving them time and costs. The Center for Disease Control and Prevention (opens in new tab) is recommending the “use of telehealth strategies when feasible to provide high-quality patient care and reduce the risk of COVID-19 transmission in healthcare settings.” The COVID-19 virus has been linked to increased risk of myocarditis and other serious cardiac (heart) conditions (opens in new tab), and experts are suggesting that particular attention should be given to cardiovascular and pulmonary protection during treatment.

In most telehealth scenarios, however, physicians lack access to objective measurements of a patient’s condition because of the inability to capture signals such as the patient’s vital signs. This concerns many patients because they worry about the quality of the diagnosis and care they can receive without objective measurements. Ubiquitous sensing could help transform how telehealth is conducted, and it could also contribute to establishing telehealth as a mainstream form of healthcare.

It can take many years for new technologies such as these to transition from research discoveries to mature applications. The fields of AI and computer vision, as a whole, are six decades old, yet it is only in the past 10 years that many applications have started to reach fruition. Research on camera-based vital sign monitoring began much more recently—within the past 15 years—so there is still a lot of effort required to help it reach maturity.

Improving accuracy, privacy, and latency for contactless vital sign sensing methods

Contact sensors (electrocardiograms, oximeters) are the current gold standard for measurement of heart and lung function, yet these devices are still not ubiquitously available, especially in low-resource settings. The development of video-based contactless sensing of vital signs presents an opportunity for highly scalable physiological monitoring. Computer vision for remote cardiopulmonary measurement is a growing field, and there is room for improvement in the existing methods.

First, the accuracy of measurements is critical to avoid false alarms or misdiagnoses. The US Federal Drug Administration (FDA) mandates that testing of a new device for cardiac monitoring should show “substantial equivalence” in accuracy with a legal predicate device (for example, a contact sensor). This standard has not been obtained in non-contact approaches. Second, designing models that run on-device helps reduce the need for high-bandwidth internet connections, making telehealth more practical and accessible. Our method, detailed below, works to improve accuracy with a newly designed algorithm (see Figure 1) and runs on-device.

Figure 1: The trade-off between latency (the time it takes to process each frame of video) and error in heart rate estimation. An optimal method would be in the top left corner, meaning we can process video frames at a high rate and with small errors. Our proposed method, MTTS-CAN, has the lowest latency and has accuracy that is well above the baseline we used for our research. The MT-Hybrid-CAN was also developed as part of our research to support devices with bigger computational power, such as PCs.

Camera-based cardiopulmonary measurement is also a highly privacy-sensitive application. This data is personally identifiable, combining videos of a patient’s face with sensitive physiological signals. Therefore, streaming and uploading data to the cloud to perform analysis is not ideal. This motivated our focus to develop methods that run on device—helping keep people’s data under their control.

Finally, the ability to run at a high frame rate enables opportunistic sensing (for example, obtaining measurements each time you look at your phone) and helps capture waveform dynamics that could be used to detect atrial fibrillation (opens in new tab), hypertension (opens in new tab), and heart rate variability (opens in new tab) where high-frame rates (at least 100Hz) are a requirement to yield precise measurements of the waveform dynamics.

MTTS-CAN: Using a convolutional neural network to improve non-contact physiological sensing

To help address the gaps in the current research, we developed an algorithm for multi-parameter physiological measurement that can run on a standard mid-range mobile phone, even at high frame rates. The method uses a type of deep learning algorithm called a convolutional neural network and analyzes pixels in a video over time to extract estimates of heart and respiration rates. The algorithm extracts two representations of the face: 1) the motion representation that contains the temporal changes pixel information and 2) the appearance representation that helps guide the network toward the spatial regions of the frame to focus on. Our specific design of this method is called a multi-task temporal shift convolutional attention network (MTTS-CAN). See Figure 2 below for details.

Figure 2: MTTS-CAN is a new neural network architecture that allows for efficient, multi-parameter physiological measurement from video. The video is analyzed to extract subtle changes in pixel intensities over time and then recover estimates of the underlying pulsatile and respiratory signals.

We introduced several features to help address the challenges of privacy, portability, and precision in contactless physiological measurement. Our end-to-end MTTS-CAN performs efficient temporal modeling and removes sources of noise without any added computational overhead by leveraging temporal shift (opens in new tab)operations rather than 3D convolutions, which are computationally onerous.

These shift operations allow the model to capture complex temporal dependencies, which are particularly important for recovering the subtle dynamics of the pulse and respiration signals. An attention module improves signal source separation by helping the model learn which regions of the video frame to apply greater importance to, and a multi-task mechanism shares the intermediate representations between pulse and respiration to jointly estimate both simultaneously.

Multi-task learning is effective for two reasons. First, the heart rhythms are correlated with breathing patterns meaning the two signals share some common properties—this is a principle known as Respiratory Sinus Arrhythmia (RSA). Second, by sharing many of the preliminary processing steps, we can dramatically reduce the computation required.

By combining these three techniques, our proposed network can run on a mobile CPU and achieve state-of-the-art accuracy and inference speed. Ultimately, these features result in significant improvements for gathering real physiological signals, like heart rate and pulse (see Figure 3).

Figure 3: MTTS-CAN reduces the error in heart rate measurement and considerably improves the pulse signal-to-noise ratio compared to previous methods such as ICA, CHROM, POS, and 2D-CAN on a large benchmark dataset.

One concern with optical measurement of vital signs is whether performance will work equally across people, including all skin types and appearances (for example, those with facial hair, wearing cosmetics, head coverings, or glasses). We have worked on characterizing these differences and helping to reduce them using personalization (opens in new tab)and data augmentation (opens in new tab). Improving sensing technology to create equitable performance is a central focus to this research.

We hope that this work advances the speed at which scalable non-contact sensing can be adopted. Atrial fibrillation (AFib) is just one of most common cardiovascular symptoms that impact millions of people and could be better detected with more accurate, easily deployed non-contact health sensing systems. Our work is a step in this direction. Through our research we are continuing to develop methods for sensing other physiological parameters, such as blood oxygen saturation and pulse transit time.

If you’re interested in learning more about our research in physiological sensing, there are a number of resources available. Our project page (opens in new tab) is a hub for publications and related content, including links to open-source code. We also recently gave a webinar on contactless camera-based health sensing that further elaborates on this work and dives deeper into how the technology works. Register now to watch the on-demand webinar/Q&A (opens in new tab).

Related publications

Continue reading

See all blog posts