Return to Podcast Home
Microsoft Research Podcast

Microsoft Researcher Dr. Daniel McDuff. Photography by Maryatt Photography.

Episode 17, March 28, 2018

One of the most intriguing areas of machine learning research is affective computing, where scientists are working to bridge the gap between human emotions and computers. It is here, at the intersection of psychology and computer science, that we find Dr. Daniel McDuff, who has been designing systems, from hardware to algorithms, that can sense human behavior and respond to human emotions.

Today, Dr. McDuff talks about why we need computers to understand us, outlines the pros and cons of designing emotionally sentient agents, explains the technology behind CardioLens, a pair of augmented reality glasses that can take your heartrate by looking at your face, and addresses the challenges of maintaining trust and privacy when we’re surrounded by devices that want to know not just what we’re doing, but how we’re feeling.



Daniel McDuff: We’ve developed a system that allows people to look at another individual and see physiological responses of that person. So it’s data they wouldn’t normally be able to see, but it’s superimposed onto that other person so they can actually see their heart beating. They can see changes in stress, based on heart rate variability. And that’s all sensed remotely. But you’re giving the individual a new sensory channel that they can leverage…

(Music plays)

Host: You are listening to the Microsoft Research podcast. A show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

(Music plays)

One of the most intriguing areas of machine learning research is affective computing, where scientists are working to bridge the gap between human emotions and computers. It is here, at the intersection of psychology and computer science, that we find Dr. Daniel McDuff, who has been designing systems, from hardware to algorithms, that can sense human behavior and respond to human emotions.

Today, Dr. McDuff talks about why we need computers to understand us, outlines the pros and cons of designing emotionally sentient agents, explains the technology behind CardioLens, a pair of augmented reality glasses that can take your heartrate by looking at your face, and addresses the challenges of maintaining trust and privacy when we’re surrounded by devices that want to know not just what we’re doing, but how we’re feeling.

That and much more on this episode of the Microsoft Research podcast.

(Music plays)

Host: Daniel McDuff, welcome to the show today. Great to have you with us.

Daniel McDuff: It’s great to be here.

Host: So you’re in Human Computer Interaction, or HCI, and you situate your research at the intersection of computer science and psychology. So, tell us in broad strokes about HCI and what you do.

Daniel McDuff: So the crux of what I do is teaching machines to understand people in a deeper way, and that involves capturing and responding to their emotional state. So, can we design a machine that really understands people, not just what they’re saying, but how they’re saying it and how they’re behaving? And I think that’s really fundamental to human-computer interaction because so much of what we do as people is nonverbal. It’s not described in language. And a lot of computer systems don’t understand that. That’s the focus of my work, is bringing that EQ to technology.

Host: EQ meaning Emotional Quotient?

Daniel McDuff: Yeah, that’s a sort of somewhat slang term and it’s used frequently to contrast IQ, which is something that technology has a lot of. Technology can answer lots of questions very quickly because it has access to all of the information on the internet, but not much technology has EQ.

Host: No. Uhhh… Does any?

Daniel McDuff: I think we’re starting to see the beginning of this. So you see social robots as a great example of systems which have some kind of personality they can express visually, some basic facial expressions on a screen or using sounds or lights. Movies are a great example. So, R2-D2 is a great example of a system that doesn’t have a face, but can still communicate emotions. Although that’s fictional, we are starting to see systems in the real world that kind of behave in a somewhat similar way.

Host: That’s fascinating. I even think of Wallace and Gromit animation where Gromit only communicates with his eyes and his eyebrows, and yet you get almost everything that he wants to say through his eyes.

Daniel McDuff: Exactly. And we take a lot of inspiration from animations and animators. Because I study facial expressions, it’s magical how some creators can show so much rich emotion just through a facial expression. And as we design systems that recognize those and exhibit them, there’s a lot we can learn from that side of the world.

Host: I’m intrigued by the field of affective computing. And I understand it aims to bridge the gap between human emotions and computational technology. So, what is affective computing, what does it promise, and what do we need computers to understand us as human beings for?

Daniel McDuff: At a high level, affective computing is designing systems that can read, interpret and respond to human emotion. And that sounds like a daunting task. There’s a lot more we need to do in research, but we’re starting to see real-world systems where this is true. So systems where they can read facial expressions for instance or understand the voice tone of someone, or look at sentiment for instance on Facebook posts or Twitter to understand the emotions that are being expressed. And this is kind of where the world is now, but in the future, we can imagine systems that use multimodal data, robotics systems that interact with us in an embodied way, that also sends this type of information. And that’s kind of the target we’re focused on.

Host: So, why do you think we need computers to understand us?

Daniel McDuff: I think it’s fundamental to how we interact as human beings. And so, when we interact with a system that doesn’t do those things that we take for granted, it can be off-putting. For instance, if a system doesn’t realize that I’m getting frustrated with it, it can be more frustrating. It can even be upsetting. There’s research showing that robots that can apologize are liked a lot more than robots that don’t, even if they’re no better at completing the task they were intending to complete. So, it can really improve our relationship and our well-being, because it fundamentally improves the interaction we have with the devices around us.

Host: So, you coauthored an article called, “Designing Emotionally Sentient Agents.” And aside from the Hollywood connotations that phrase brings to mind, what should we understand about this research, Daniel, and what should we be excited about or concerned about?

Daniel McDuff: I think there’s a lot to be excited about in homecare, in healthcare, in understanding just human interaction even more. If we can design systems to mimic some of those things, it will deepen our understanding of how we as humans behave. There are a number of challenges that we need to overcome. One is how do we sense this information. And senses can be intrusive. You know, devices that are around us that are listening for commands all the time are starting to appear and you can imagine in future, there could be camera systems as well. So we need to think about the social norms that exist around the sensing side of things. Where does that data go? How is it stored? How do we know that the sensor’s on? How do we control it? How do we stop it from recording when we want to? And then, how do we allow other people who are in our lives to not be sensed even if we’re being sensed or if I invite someone into my home and I’ve got a device that’s always listening or always watching, what does that mean for our social interaction? So, I think there’s some challenges to overcome there, but there’s also more philosophical challenges about, how much do we teach computers of human emotion? Is it possible for a machine ever to feel emotion? What does that mean? And how should machines express emotion or respond to this information? We definitely don’t want to design systems that are manipulative or that make people feel like they’re more intelligent than they are. If someone sees a system that appears emotional, they might think, wow, this is really, really intelligent, even if it’s only expressing very basic behaviors. And that can be challenging because some of the other abilities of that system might be quite weak. And so people might trust it even if it can’t actually perform, accurately, the tasks it’s trying to do.

Host: You’re using terms that are interesting. And interesting is a kind of placeholder word for other words that I’m actually thinking. Like “sentient” and “understanding” regarding a machine. And I wonder how I should interpret that? What do people like you and your colleagues really believe about what you just addressed? Can a machine ever feel? Can it really understand? Can it become sentient?

Daniel McDuff: I think machines are fundamentally different to humans. Machines can recognize some expressions of emotion. They can respond to them. But I don’t think that that constitutes feeling and emotion. Feeling and emotion requires experience. It requires a reward and a cost associated with different actions. It’s much, much more complex than that. So I don’t think machines will ever experience emotion in the way that we do, but they will have many of the, sort of, fundamental skills that we have.

(Music plays)

Host: What can you tell us about the emerging field of what I would call artificial emotional intelligence or emotional technology? You use an example of a bathroom mirror that has ambient intelligence and can tell whether I’ve slept well. Why do I need that?

Daniel McDuff: That’s a good question. I think it’s important that we design systems that are ultimately beneficial to people. And one of the roadblocks, especially in healthcare, is that there’s so much rich data out there, but it’s very hard to understand it, or it’s cumbersome to monitor it. And so designing systems that make it seamless to be able to collect and understand that type of data is really important. So, at MIT, when I was a graduate student, we built a mirror that had a camera embedded in it. It was actually hidden behind 2-way glass, so all it looked like was just a regular mirror. But when you looked in the mirror, the camera was using some remote sensing technology we built to measure the heart rate of the person. And we can also measure things like heart rate variability, which is correlated with stress. And so the mirror could then display that information back to the user. So it’s not just reflecting their outward appearance, but their sort of inner physiological state as well. And I found that really compelling because, in many cases, we want to know that information, but we might not want to strap on a sensor or have to go out of our way to collect it and if it can be digitally captured by the devices we already use, there’s something quite compelling about that.

Host: Let’s talk about the technical aspects of your work for a bit. Much of it’s centered on computer vision technologies and involves webcams and algorithms that aim to understand emotional states. What’s the field of computer vision founded on technically and what new developments are we seeing?

Daniel McDuff: So computer vision is exploding. The past 10 years have been some of the most exciting in this domain, with the invention of what’s commonly called deep learning. And so this is the ability to leverage huge amounts of data to train systems that are much more accurate than previous systems were. So, for instance, we have object recognition, text recognition, scene understanding that’s way more accurate than it used to be, because we have these systems that capture lots of complexities of the data. And because there’s so much data they can learn from, they get a really good representation. And understanding facial expressions has also benefited from the advances in this technology, as has a lot of other areas of affective computing, whether it’s speech recognition or understanding vocal prosody and things like that. So, there’s a lot of advances that have happened that basically improve the underlying sensing. And I don’t think up until this point we’ve really had the volume of data about emotions to go to the next level where we can really understand, okay, how do we build a system that actually knows what to do with these sensor inputs, with something that’s as amorphous as emotion is and hard to define?

Host: So, the basis of what you’re doing is on deep neural networks and machine learning models that you’re then applying to the affective domain.

Daniel McDuff: Exactly, yes. So we use deep learning for almost all of the sensing modalities we use, whether it’s vision-based or audio-based or language-based. And then that feeds into a system which is taking sort of intermediate-level information. For instance, does my facial expression appear positive or negative? Is my voice tone high-energy or low-energy? Is the language I’m using hostile or serene? And then, those intermediate states feed into a high-level understanding which is combined with context. So we need to know what’s happening to interpret emotion. We can’t just observe the person. We need to know the situation, the social context. And so that’s kind of where we’re moving, is really to combine these sensor observations with more contextual information.

Host: So, I wasn’t going to ask you this. It wasn’t on my list. But how do you gather data on emotions? Do you have to bring people in and make them angry? I mean, it’s a serious question in a funny way.

Daniel McDuff: In the past, that was how it was often done. But a lot of my work in the last few years has been focusing on in-situ, large-scale data collection. So we always ask people if they want to opt in. And if they do, then we enable them to use a system which is part of their everyday life. So this might be a system that runs on their computer or runs on their cell phone and collects this data over time. Often, we might prompt them throughout the day, sort of how are you feeling? Or we might say is this feeling that we think you’re feeling correct? In order to get some kind of ground truth. But ultimately, we want to be able to collect real-life data about people’s emotional experiences. Because we know if they come to a lab, it’s not exactly the same as how it would be in the real world.

Host: One of the applications of this emotional technology is the workplace. In an MIT Sloan Management Review article, it claims that emotion-sensing technologies could help employees make better decisions, improve concentration, alleviate stress. So, tell us how this works and give us some examples of what it looks like it. And then maybe tell me why I would want my boss to monitor my eye movements, my facial expressions, and my skin conductance?

Daniel McDuff: So, one example we give in that article is about a trader in Japan who unfortunately swapped the number of shares they were selling and the price of the shares, and got those two numbers the wrong way around. That ended up being a huge financial loss. And in high-stress situations that can be really problematic. Another example would be air traffic control. A very high-stress job where people have to be performing at a high-level for the whole of the duration of their shift. And so if we can design technology that is able to sense when people are becoming overloaded, too stressed to perform at the level that they need to, we could give them that feedback. So, for individuals, that could be very helpful for knowing when they need to take a break. I, myself, in a job, you know, on an average day, it would be great if my computer knew when I was in flow and stopped interrupting me with emails notifications. Or if I needed to take a break, it could suggest things that would help me relax and make me more productive when I came back to my desk. And then I think it would also benefit teams and organizations, knowing the well-being of your company is a really important thing. And we’re starting to see the development of really science around organizations, and particularly focused on the social components. Social capital is really important, and emotion plays a big role in that.

Host: Tell me what safeguards a designer or a developer might think about so that this technology doesn’t become “nanny cam” in the workplace?

Daniel McDuff: That’s a really, really important question. And I think as we design this technology, it’s important that we design social norms around how they’re used. Ultimately, technology will advance. That’s somewhat inevitable. But how we use technology and the social norms that we design around it are not inevitable. So to give an example, one of the practices we follow is always “opt in.” So we always make sure that people choose to switch on sensors rather than having it imposed upon them. Another example is as we mentioned before, allowing people to turn off sensors. And that’s really important people have that. It increases their trust and comfort with the system a lot. Those are a couple of examples about kind of social norms we can design around this technology. And I think there are many more that will develop as we kind of advance technology and think about use cases.

(Music plays)

Host: Let’s talk about reality for a bit. There’s actual reality, which I have a passing familiarity with, but also virtual reality, augmented reality, mixed reality. There’s so many realities. Give us a baseline definition of each of those different realities, uh, so we have a frame of reference for what I want to talk about next.

Daniel McDuff: Great, so virtual reality is a completely alternative environment. So this is where, most people will probably be familiar with virtual reality in terms of the headsets with a screen where all the information that you see is displayed on that screen. Then augmented reality is usually when you can see the real world, but there’s some augmentation of what you see. So there might be a transparent screen which is actually displaying certain objects which are superimposed on the real world. And then there’s this idea of mixed reality, which is really blurring the boundaries between virtual and augmented reality. So you’re leveraging much deeper understanding about the environment, as well as incorporating a lot more augmentation.

Host: So let’s go along that thread for a second here. Because when you talk about augmenting human perception through mixed or virtual reality, you suggest that VR might be able to help people develop “superhuman” senses. What are the possibilities and challenges even, of advancing human senses in this way?

Daniel McDuff: Yeah, so I mean, one of the things I find most fascinating about other areas of science, like neuroscience, is how adaptable we are, and particularly the brain is, at being able to learn new things based on sensory input. So, we have a panel at South by Southwest where we’re discussing some of the ways that sensor inputs can influence people’s perception. And one example that we’ve developed is a system that allows people to look at another individual and see physiological responses of that person. So it’s data they wouldn’t normally be able to see, but it’s superimposed onto that other person so that they can actually see their heart beating. They can see changes in stress based on heart rate variability. And that’s all sensed remotely. But you’re giving the individual a new sensory channel that they can leverage, something that they wouldn’t normally have.

Host: So this is like x-ray vision.

Daniel McDuff: In a sense, yes. That’s a good analogy.

Host: I mean, from the superhero realm, that’s…

Daniel McDuff: Exactly, yeah.

Host: So the idea of superhuman senses would be physiological senses that you wouldn’t normally be able to see, aside from somebody sweating or blushing, or you know their facial expressions. It’s inside their bodies.

Daniel McDuff: Exactly, yeah. Yeah, it’s hidden information that wouldn’t normally be accessible, but using the new technology, like high-definition cameras and this augmented experience that we can create through the HoloLens headset, we can allow you to see that information in real time.

Host: So maybe now junior high kids can actually find out if someone’s in love with them, just by putting on these glasses? And they don’t have to ask their friend to go ask their other friend if he likes me.

Daniel McDuff: I’ve always wanted to build that demo and just see how badly it fails.

Host: That would actually be a really compelling application of the technology, just to help the junior high kids. Um, so you’re one of the creators of an application of this technology that you call CardioLens, and while it’s still in the early stages of research, and it’s not being used in any real-life situations right now, you’re actually able to read my heartrate by looking at my face through a pair of augmented reality glasses. Tell me more about this. What are the possibilities of this research, down the road?

Daniel McDuff: Yeah, so I’ve been working on this area of remote, or non-contact, physiological measurement for a while. And this is the idea that a regular webcam, just the camera that might be on your cell phone or on your laptop, has the sensitivity to pick up very small changes in the color of your skin, or light reflected in your skin, to be more accurate, which are related to blood flow. So actually, by analyzing a video input, video stream to that camera, we can pick up your pulse, we can pick up your respiration rate and your heartrate variability. And there’s new work showing you can measure blood oxygenation and other things. And people are trying to get towards things like blood pressure. So just using a regular device, no adaptation to the hardware, and some software, we can recover this information. So, what we did was to the put the algorithm on the HoloLens, which has a camera that faces forward. And so when you look at someone, it detects their face. It segments the skin. It analyzes the color change and recovers the physiological information, and then displays that back in real time, superimposed onto their appearance.

Host: How accurate is it?

Daniel McDuff: So, the technology can be very accurate. So we’ve done a lot of validation of this. We can measure heartrate to within 1 or 2 beats per minute, typically on a regular video. We’re using deep learning to address this problem, and we’ve got really, really good results on some of the hardest data sets that we’ve tried it on. When you take this out into the real world where people are moving around and the lighting’s changing and you can’t control if they’re making facial expressions or speaking, or you know if it starts to become more challenging, that’s the type of data that we’re pushing towards addressing. So we want to start designing methods that are robust to all of those variations that we’d actually see in a real-life application.

Host: But normally, you would measure someone’s blood pressure or their pulse in a clinical setting. I mean, you wouldn’t necessarily – it would be a tool maybe for the medical community first, or…

Daniel McDuff: Yeah.

Host: Or the junior high boy that needs to know, do you love me?

Daniel McDuff: That might be the biggest market. No, I think… one application I am particularly excited about is in medical applications, for instance, surgery. So, you could see if a particular part of the body has good or poor blood flow. And that could be important in transplant operations where you’re attaching a new part of the body, a new organ, and you need to know if blood is flowing to that particular part of the body. And with a heads-up display, a surgeon could potentially look at that region and see if there is a blood flow signal. But there are other applications, too. Another example would be, for instance, being able to scan a scene and identify if there’s someone who’s alive, for instance, in a search and rescue application.

Host: Oh, interesting.

Daniel McDuff: And this also works with infrared cameras. So even if it’s dark, we can still measure the signal. There are other things like baby monitors or in hospital ICU units monitoring physiological information without having to have people wired up to lots of different sensors. We can just use a camera to do that.

Host: Every single show, I end up shaking my head. No one can see it happening, but it’s like, really? This is happening? I can’t believe it. It’s amazing. Talk about the trade-offs between the promises these technologies make and some of the concerns, very real concerns, about privacy of the data.

Daniel McDuff: Yeah, as I mentioned before, I think it’s really important that we design this technology appropriately. And I think that’s where we’ll see the biggest benefits. The benefits are when people recognize this is something that actually helps me in my everyday life or helps in a specific application like healthcare. There are definitely big challenges to privacy, because a lot of what we need to do to deploy this technology, is to be able to sense information longitudinally, on a large scale, because everyone experiences emotion differently. You can’t just take 10 people and train a system on 10 people that will generalize to the whole population. And so we do need to overcome that challenge of, you know how do we make this technology such that people feel comfortable with it, they trust it, and not leave them to feel as though their privacy is violated or that it’s too obtrusive. And so I think in design challenges, it’s about designing ways for people to be aware that technology is on, that’s it there, what it’s measuring, what it’s doing with that data. And these are some more unsolved problems as of yet.

Host: You said you prefer social norms over governmental regulations or legal remedies. So what’s the balance between the responsibilities of scientists, engineers, and programmers here, versus big regulatory initiatives like GDPR in Europe, and other things that might be coming down the pike?

Daniel McDuff: I think both are important. But the reason I prefer focusing on social norms is because as a designer, as an engineer, that’s something I can actively influence every day in my job. So I can think about, okay, so I’m going to design this sensor system that people are going to choose to use and capture their emotions and it’s going to create an experience that adapts to how they’re feeling. I can choose how to design that, and I can influence the social norms around that technology. So being kind of a leader in the research space allows me to do that actively, regularly. And I don’t think we can necessarily rely 100 percent on government or regulation to solve that piece of the puzzle. A good part of being part of MSR is that we’re very involved with the academic community. I’m involved with the Future of Computing Academy, the ACM. And our task group within that organization is to think about the ethical questions around AI. Not just in affective computing technology, but just broadly with machine learning and AI technology that can make decisions about important things like, for instance, healthcare or justice. And I think social norms and governmental regulation both serve a purpose there. But one of the things I personally can actively work towards on a daily basis is thinking through, what do I ask people to give up in terms of data and what do they get back for that, and how is that data used? And that’s something I’m really, really interested in.

(Music plays)

Host: Let’s talk about you for a second. I’m curious how you got interested in the emotional side of technology and how you ended up at MSR. Who were your influences, your inspirations, your mentors?

Daniel McDuff: So I did my Master’s at Cambridge University and was focused on machine learning. But I was very interested in how I could address more social problems with that technology, not just focus on predicting stock market prices or some of the sort of numerical analyses that are often solved using machine-learning algorithms. I wanted to see how this technology could actually help people. And at the time, my advisor for my PhD, Rosalind Picard, who is one of the founders of this field, was working a lot with applications for people on the autism spectrum for whom understanding emotions is a complex task and often a big challenge in social situations. And that was one of the reasons that I joined that lab, is because I really believed in the potential benefits of affective computing technology, not just to one portion of the population, but to everyone. I could see how it could benefit my life as well. So that’s how I got into it. And you know it’s becoming more true now, but certainly 10 years ago, there was no technology you could really think of that responded or understood human emotion.

Host: No. Even now.

Daniel McDuff: Even now, I mean, yeah. We’re getting there in research, but there’s not many real-life applications you could point towards and say, oh, this is an example of a system that really understands nonverbal or emotional cues.

Host: Right. So what was your path from Cambridge and Rosalind Picard to here?

Daniel McDuff: So I went to the MIT media lab where I did my PhD. And there I worked a lot on large-scale data analysis to do with understanding emotions in real-world contexts. And then I worked for a couple years at a startup and joined MSR out of that, and now lead the Affective Computing Technology development within Research here.

Host: That’s really cool. So, as we wrap up, Daniel, what thoughts or advice would you leave with our listeners, many of whom are aspiring researchers, who might have an interest in human computer interaction or affective computing? What lines of research are interesting right now? What might augment, to use an industry term, the field?

Daniel McDuff: If I were to sort of summarize the areas that I think are most important, the first would be multi-modal understanding. So, in the past, a lot of the systems that have been built have focused just on one piece of information like for instance, facial expressions, or voice tone, or text. But to really understand emotions, you have to integrate all that information together. Because if I just look at facial expressions, you know if I were to show you a video of someone without the audio and without the information about what they were saying, it would be hard to interpret exactly how they felt. Or many people have probably experienced being on a phone call where they haven’t been able to exactly understand how someone was feeling because you’ve only got the voice tone and language to rely on. You don’t have all of that visual information about their gestures and facial expressions and body posture. So I think multi-modal understanding is really important. Another area that I’m particularly interested in is something we’ve touched on already, which is kind of deploying this in the real world. So, how do we take these experiments that have typically been performed in labs, in research environments, where you bring 10 or 20 people in and you get them to experience the system and you evaluate it. Which is fine for controlled studies, but ultimately, if we’re going to evaluate the real system and how people will actually respond to it in their everyday lives, we need to deploy it. And so that’s something we’re focused on, is really designing things that are so seamless that people can use them without them being a burden, and we can start to mine this data that occurs in everyday contexts.

Host: Daniel McDuff, it’s been fascinating talking to you. I wish there was more time, but thanks for coming in.

Daniel McDuff: Thank you very much. It’s a pleasure to be here.

[Music plays]

To learn more about Dr. Daniel McDuff’s work and find out how machine learning can help you improve your relationship with your computer, visit

Français English