Return to Podcast Home
Microsoft Research Podcast

Speech and language: the crown jewel of AI with Dr. Xuedong Huang


Dr. Xuedong Huang

Episode 76, May 15, 2019

When was the last time you had a meaningful conversation with your computer… and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft’s Speech and Language group, is successful, you will. And if his track record holds true, it’ll be sooner than you think!

On today’s podcast, Dr. Huang talks about his role as Microsoft’s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from “perceptive AI” to “cognitive AI” and that much closer to truly human intelligence.



Xuedong Huang: At some point, let’s say computers can understand three hundred languages, can fluently communicate and converse. I have not run into a person who can speak three hundred languages. And not only machines can fluently communicate and converse, but can comprehend, understand and learn and reason and can really finish all the PhD courses in all subjects. The knowledge acquisition, reasoning, is beyond anyone’s individual capability. When that moment is here, you can think about how intelligent that AI is going to be.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: When was the last time you had a meaningful conversation with your computer… and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft’s Speech and Language group, is successful, you will. And if his track record holds true, it’ll be sooner than you think!

On today’s podcast, Dr. Huang talks about his role as Microsoft’s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from “perceptive AI” to “cognitive AI” and that much closer to truly human intelligence. That and much more on this episode of the Microsoft Research Podcast.

Host: Xuedong Huang, welcome to the podcast.

Xuedong Huang: Thank you.

Host: You are a Microsoft Technical Fellow in the speech and language group, and you lead Microsoft’s spoken language efforts. So, we’re going to talk in depth about these in a bit, but first, as the company’s Chief Speech Scientist, give us a general view of what you do for a living and why you do it. What gets you up in the morning?

Xuedong Huang: Well, what we do is really make sure we have the best speech and language technology that can be used to empower a wide range of scenarios. The reason we have a group to do that is really I feel that, you know, this is not only the most natural way for people to communicate, as we’re doing right now, but it’s really the hardest AI challenge we’re facing. So, that’s what we do, trying to really drive breakthroughs, deliver these awesome services on our cloud, Azure Services, and make sure we are satisfying a wide range of customers both inside Microsoft and outside of Microsoft. There are three things, really, if you want to frame this whole thing.

Host: Yeah.

Xuedong Huang: The first, we have the horsepower to really drive speech recognition accuracy. To drive the naturalness of our synthesis effort. To make sure translation quality is accurate when you translate from English to Chinese or French or German. So, there’s really a lot of science behind that, making sure the accuracy, naturalness, latency, they are really world-class. So that’s one. The second one is really, we not only provide technology, we deliver services on Azure. That from Office to Windows, Cortana, they are all depending on the same cloud services. And we also have edge devices like our speech device, SDK. So, we want to make sure the speech on the edge and the cloud, they are really delivered in the modern fashion.

Host: Mm-hmm.

Xuedong Huang: That’s the platform in the cloud and embedded. So, that’s the second: the platform is modern. The third one is really, to show our love to the customer, because we have wide range of customers worldwide. We want to really delight and make sure our customer experience using speech translation is top notch.

Host: Yeah.

Xuedong Huang: That’s actually the three key things I do: AI horsepower, modernize our platform in the cloud and on the edge, and love our customers.

Host: Well, and you’ve got a lot of teams working in these groups to tackle each of these “pillars” we might call them.

Xuedong Huang: Yes. We have teams worldwide as well.

Host: Yeah.

Xuedong Huang: And so, the diversity is amazing because we are really trying to address the language barriers.

Host: Yeah.

Xuedong Huang: Trying to remove the language barriers. So, we do have teams in China. We have teams in Germany, in Israel, in India and in the US, of course. So, we really work around the globe trying to deal with these language challenges.

Host: So, I want to start by quoting you to set the stage for our conversation today. You said, “Speech and language is the crown jewel of AI.” So, unpack that for us.

Xuedong Huang: Mm-hmm. Well, we can think in the scale of human’s evolution. At some point, the language was born. That accelerated human’s evolution. If you think about all the animals on this planet, you know, there are animals running faster than humans, they can see better…

Host: Their teeth are sharper.

Xuedong Huang: …especially in the night.

Host: They’re stronger.

Xuedong Huang: Yep. They can actually hear better, smell better… Only we, humans, have the language. We can organize better. We can describe in science-fiction terms. We can really organize ourselves, create a constitution. So, if you look at the humans, it is speech and language that set us apart from other animals. For artificial intelligence, speech and language would drive the evolution of AI, just like it did to humans. That’s why it’s the crown jewel of AI.

Host: All right.

Xuedong Huang: And it’s a tough one to crack.

Host: Yeah. There’s a whole philosophical discussion on that topic alone, but it leads to some interesting questions about, you know, if you are wildly successful with machine language, what are these machines?

Xuedong Huang: So, let’s just actually, you know, set our imagination…

Host: Yeah, let’s do.

Xuedong Huang: …off a little bit, right? At some point, let’s say computers can understand three hundred languages, can fluently communicate and converse. I have not run into a person who can speak three hundred languages. And not only machines can fluently communicate and converse, but can comprehend, understand and learn and reason and can really finish all the PhD courses in all subjects. The knowledge acquisition, reasoning, is beyond anyone’s individual capability. When that moment is here, you can think about how intelligent that AI is going to be.

Host: Is this something you envision?

Xuedong Huang: Yes.

Host: Do we want that?

Xuedong Huang: Yes. I think this world will be a much better place. I was in Japan just a few weeks ago, carrying Microsoft Translator on my mobile devices. I was able to really communicate with Japanese who do not speak Chinese or English. It’s already there. Microsoft translator can speak the language I do not speak and help me to be more productive when I was in Japan.

Host: So, I’m all about that. It just scares me a little bit to think about a machine… “we weren’t first, we’re not last, we’re just next…”

Xuedong Huang: But you know, there are two levels of intelligence. The first level is really perceptive intelligence. That is the ability to see, to hear, to smell. Then the high level is cognitive intelligence. That is the ability to reason, to learn and to acquire knowledge. Most of the AI breakthroughs we have today, they are in the perceptive level such as speech recognition, speech synthesis, computer vision. But this high-level reasoning and knowledge acquisition, cognitive capability, is still far from being close to human’s level.

Host: Right

Xuedong Huang: And what I’m excited about translation, it is really something between perceptive intelligence and cognitive intelligence. And the fact that we are actually able to really build the success on the perceptive intelligence and expand into cognitive intelligence is quite a journey.

Host: Right.

Xuedong Huang: And uh, I do not know when we are going to reach that milestone. But that one is coming. It’s just a matter of time. It could take fifty years, but I think it is going to happen.

Host: We’ll have to come back for another podcast to talk about that milestone because we’re going to talk about a couple of milestones in a minute. But first I want to do a little bit of backtracking, because you’ve been around for a while and you started in Microsoft Research right about the time Rick Rashid was setting the organization up and speech was one of the first groups that was formed. And according to MSR lore, the goal of the group was to “make speech mainstream.” So, give us a brief history of speech at MSR. How has the research gone from “not mainstream” in those early “take risks and look far out days” to being a presence in nearly every Microsoft product today?

Xuedong Huang: Before I joined Microsoft Research, I was also on the faculty at CMU in Pittsburgh. So, Rick Rashid was a professor there. I was a junior faculty member. So, I was doing my research, mostly at CMU, on speech. Microsoft reached out and they wanted to set up a speech group. So, I moved, actually, on the first day of 1993, after New Year’s break. I flew from Pittsburgh to Seattle and started that journey and never changed. So, that was the beginning of Microsoft Speech. We were the research group that really started working on bringing speech to the developers.

Host: Right.

Xuedong Huang: So…

Host: Not just blue-sky research anymore…

Xuedong Huang: Not blue-sky research. So, we licensed technology from CMU. That’s how we started. So, we’re very grateful to CMU’s pioneering research in this area. So, we were the research group, but we delivered the first speech API, SAPI, on Windows ’95. As a research group, we were pretty proud of that because usually research is doing only blue-sky research. We not only did blue-sky research, we continued to push the envelope, continued to improve the recognition accuracy, but we also worked with Windows, brought that technology to Windows developers. So, SAPI was the first speech API in the industry on Windows.

Host: Wow.

Xuedong Huang: And that was really quite a journey. And then, I eventually left research, joined the product group. I took the team! And it was also an exceptional Microsoft speech research group that came with me. Went to the product group. So, this has been really a fascinating twenty-seven years’ experience at Microsoft. I stopped doing speech after 2004, after we shipped the speech server, and I started many different things including running the incubation for research as a startup.

Host: Yeah.

Xuedong Huang: And I also worked as an architect for Satya Nadella when he was running Bing.

Host: Okay.

Xuedong Huang: And then, when Harry was running the Research and Technology group, I was helping incubate a wide range of AI projects from foundational pieces like a GPU cluster, Project Philly, the deep learning tool kit, CNTK. And of course, speech research, all the way to the high-end solutions like customer care intelligence.

Host: Yeah.

Xuedong Huang: And about three years ago, I had the privilege to return to run a combined a Speech and Language group. So basically, we were able to consolidate all the resources working on speech and translation and that was the story, really, you know, the journey of my experience. A fascinating twenty-seven years.

Host: Where does Speech and Language live right now?

Xuedong Huang: So, as I said, we moved back and forth multiple times between research and product groups. Right now, we are sitting in the Cloud and AI group. This is a product group. We’re part of these cloud services and we provide company-wide and industry-wide speech and translation services. We also have speech and dialog research. They are really operating like a research group.

Host: Yeah.

Xuedong Huang: They’re all researchers in that team. As what Rick has been saying, tech transfer is a full-contact sport. We are not just, you know, a full-contact sport, we’re one body sport. So, it’s actually a very exciting group, with a group of very talented, very innovative people.

Host: So, it’s still forward-thinking in the research mode…

Xuedong Huang: It’s both forward-thinking and well-grounded. We have to be grounded to deliver services from infrastructure to cost of serving, and we also have to be standing high to see the future, to define what is the solution that the people need and people want, even though the solution might not have existed and they may not know what it is at this moment.

(music plays)

Host: Well, let’s talk about some specific research milestones that you’ve been involved in. They are really interesting. Three areas you’ve been involved in: conversational speech recognition, machine translation and conversational Q&A. So, let’s start with the recognition. In 2016, you led a team that reached historical human parity in transcribing conversational speech. Tell us about this. What was it a part of, how did it come about?

Xuedong Huang: So, in 2016, we reached human parity on the broadly used Switchboard Conversational Transcription task. That task has been used in the research community and industry probably over ten years. In 2017, we redefined the human parity milestone, so we’re not competing with only one single person, we’re competing with a group of people to transcribe the same task. So, I would say 2017 is really a historical moment. In comparison to a group of people transcribing the same task, Microsoft Speech Stack outperformed all four teams combined together. When I challenged our research group, nobody thought that was even feasible. But in less than two years, amazingly, when we had the conviction and the resource and the focus, magic indeed happened. So, that was actually a fantastic moment for the team, for science, for the technology stack. That was the first human parity milestone for my personal professional career.

Host: So, I want to go in the weeds a little bit on this because this is interesting what you say, in two years, nobody thought it was possible and then you did it. Tell us a little more about the technical aspects of how you accomplished this.

Xuedong Huang: So, if you look at the history of speech research, the speech group pioneered many breakthroughs that got reused by others. Let’s take translation as an example. So, even for speech, in the early 70s, the speech recognition used more traditional AI, like rule-based approach, expert system. And IBM Watson research pioneered statistic speech recognition, using Hidden Markov Model, using, you know, statistic language model. They really pushed the envelope and advanced the field. So, that was a great moment. It was the same group of IBM speech researchers, they borrowed the same idea from speech recognition, applied that to translation. They rewrote translation history. Really advanced the quality of translation substantially. And after Hidden Markov Model, it was deep learning that started with speech recognition, neural speech recognition. And once again, translation borrowed the same thing with neural machine translation and also advanced. So, you can see the mirror of using technology speech people pioneered. Actually, speech guys have been doing this, you know, systematic benchmarking, funded by DARPA, very rigorous evaluation, that really changed how science and engineering could be evaluated.

Host: Right.

Xuedong Huang: So, there are many broad lessons from the speech technology community that could have been used broadly, beyond speech. So, we got trained to deal with tough problems. It’s no wonder the same group of people could have achieved this historic milestone.

Host: Well, let’s talk about another human parity milestone: the automatic Chinese to English news translation for the WMT-2017 task. And I had Arul Menezes on the show to talk all about that. But I’d love your perspective on whether and how – this kind of goes back to what we talked about at the beginning – whether and how you think machines can now compare to traditional human translation services and why this work is an important breakthrough for barriers between people and cultures.

Xuedong Huang: So, the second human parity breakthrough from my team is equally exciting. As I said, transcribing Switchboard Conversational Speech is a great milestone. But it’s really at the very low level, at the perceptive AI level. Translation is a task that is between perceptive AI and cognitive AI. Of course, translation is a harder task, and nobody believed we could have achieved this. So, we set a goal: in five years, let’s see if we can achieve translation human parity on the sentence by sentence basis. So, I want to really put that condition here. When human, like you and me translate, we are looking at the whole paragraph, we have the broader context, we do a better job. So, we limited ourselves because, for the broader use, the WMT, which is just news translation measured on the sentence by sentence level…

Host: Um-hum.

Xuedong Huang: …and so, it’s a broadly open research, public benchmark. Even for that one, we thought it could have taken five years. So, we applied the same principle: do it on the success we had on transcribing Switchboard Speech Recognition. But this time, we actually went one step beyond. We partnered with Microsoft Research Group in Beijing because it’s a Chinese to English translation. So, across Pacific, multiple teams in Microsoft Research Asia, worked together days and nights. Amazingly, this group of people surprised everyone. We delivered this in less than a year, reaching human parity, an historical translation level, better than professional people on the same task, as measured by our scientists. So, this time, really, we did something magic. I’m very proud of the team. I’m very proud of the collaboration.

Host: Well, another super interesting area that I’d love to talk about with you is what you call COQA. And that’s C-O-Q-A. Conversational Q&A. So, obviously we’re talking about computers having this conversation with us, question and answer. Tell us about the work that’s going on in this most human, and perhaps most difficult, of tasks in speech recognition technology.

Xuedong Huang: So, this task is pioneered by Stanford researchers. It’s even one step closer to cognitive AI. This is really machine reading comprehension task with conversation, with dialogue, about the task. Let’s say you read a paragraph. Then we challenge the reader to answer correctly with a sequence of questions that are related. For example, if you read the paragraph about Bill Gates, the first question could have been, “Who is the founder of Microsoft?” The second question could be related to the first one, “How old is the person when the person started?” Or you could have said, “And when the person retired, how old was he?” So, that context relevancy is harder than simple machine reading comprehension because there’s a sequence of related questions you have to answer, given the context. So, for this latest breakthrough, and I have to give credit mostly to our colleagues in Beijing research lab, we have been pioneering this working together using shared resources and the infrastructure. It’s just amazing. I’m so impressed with the agility and the speed we have to achieve this amazing conversational question and answering challenge. So, the leading researchers, they are all in Beijing, will play a great and supporting role, helping Microsoft, once again, be the first to achieve human parity on this broadly watched AI task. Nobody believed anyone could have achieved this conversational Q&A human parity in such a short time. And so, we thought it might take two years. Once again, we broke historical record.

Host: Well, we’ve talked a little bit about the more technical aspects of what you are doing and how you are doing this. So, on this last one, are there any other methodologies or techniques that you brought to the table to conquer this Q&A task?

Xuedong Huang: So, Microsoft has accumulated thirty years of research and experiences in AI, right? The natural language group in Beijing, they have been doing this in the last twenty years and they have accumulated lots of talents, a lot of experiences. And we basically use deep learning and transfer learning. Also, we built our success on top of the whole community.

Host: Mm-hmm.

Xuedong Huang: For example, Google, they delivered this fascinating technology called BERT. And…

Host: Is that an acronym?

Xuedong Huang: Yes, it’s an acronym. It’s embedding technology. We built the success on top of that, expanded that. That’s how we achieved the human parity breakthrough.

Host: Mm-hmm.

Xuedong Huang: So, it’s really a reflection of the collective community. And I talked about the collaboration between Microsoft Research in Asia and our team in the US. Actually, this is a great example of collaboration of the whole industry.

(music plays)

Host: On the heels of everything that could possibly go right – and it’s pretty exciting what you’ve described to us in this podcast – we do have to address what could possibly go wrong, if you are successful.

Xuedong Huang: Mm-hmm.

Host: You want to enable computers to listen, hear, speak, translate, answer questions – basically, communicate – with people. Does anything about that keep you up at night?

Xuedong Huang: Yes, absolutely. My worry is really, someday, humans can be too dependent on AI. And AI will never be perfect. AI would have a unique sort of biases. So, I worry about that unconscious influence.

Host: Right.

Xuedong Huang: So, how to deal with that is really a broad societal issue that we have to be aware and we have to address. Because just like anyone, if you have an assistant you depend on, you absolutely know much that assistant can influence you, change your agenda, change your opinion. And AI, one day, is going to play the same role. AI will be biased. And how do we deal with that is my top concern.

Host: Yeah.

Xuedong Huang: If everything goes well. That is really, you know, a top issue we have to deal with. We have to learn how to deal with it. We do not know because we are not there yet.

Host: So, what kinds of “design thinking” are you bringing to this as you build these tools that can speak and listen and converse, because one of the biggest things is that human ability to impute human qualities to something that’s not human…

Xuedong Huang: I think just, you know, there are enough responsible people working on AI. And the good news is that we’re not there yet, right? So, we have time to work together to deal with that and make sure AI is going to really serve mankind, not to destroy mankind. So that’s my top worry…

Host: Yeah.

Xuedong Huang: …what keeps me awake. But my short-term worry is really AI is not good enough! Not yet!

Host: Okay.

Xuedong Huang: And people, as Bill Gates used to say, you always overestimate what you can do in the short-term and underestimate the impact in the long-term. For this case, we cannot underestimate the long-term impact.

Host: Right.

Xuedong Huang: The long-term milestone.

Host: Okay. It’s story time.

Xuedong Huang: Mmmm. Good!

Host: Tell us a bit about your life. What’s your story? What got you interested in research, particularly the speech and language technology research, and what was your path to MSR?

Xuedong Huang: Good. Um, I was a graduate student in Beijing’s Tsinghua University. At that time, my first computer was Apple 2. So, because you know Chinese language is not easy to type. So, it was very cumbersome. So, that necessity brought me to speech recognition. My dream at that time was, as a graduate student in Tsinghua, actually was in AI. In AI of Tsinghua’s, you know, graduate school…

Host: Yeah.

Xuedong Huang: …was fantastic to have, you know, so many professors and faculty members who had that long-term vision and set-up the pioneering environment for us to explore and experiment with. So, I finished my master’s degree. I was in the PhD program and I had been working on speech recognition since ’82 because I was enrolled, admitted, to Tsinghua in 1982. That dream, to make it easier for people to really communicate with machines, never disappeared. So, I have been working on this for over thirty years. Even though, at Microsoft, for a short period of time, I stepped out of speech, but I was still doing something related. So, I really thought this was a fascinating story. So, I got some personal really interesting story. As I said, you know, it was hard to type in Chinese when I was at the Tsinghua University. And I didn’t finish my PhD at Tsinghua. I went to the University of Edinburgh…

Host: Okay.

Xuedong Huang: …in Scotland. And I did finish my PhD there. But my personal pain point when I first landed in Edinburgh was really – I learned English, mostly American English, in China. It wasn’t that good because it wasn’t my native language. But listening to a Scottish professor…

Host: Oh, my goodness!

Xuedong Huang: …talking was always challenging. But I was so grateful BBC had closed captioning.

Host: Oh, funny.

Xuedong Huang: So, I really learned my Scottish English from watching BBC. And I have to say, that automatic captioning technology is available on Microsoft Power Point today. And that journey of personal pain points to what Office Power Point teams can bring together is fascinating and personally extremely rewarding.

Host: Yeah.

Xuedong Huang: I’m so grateful to see the technology I have worked on is going to help many other people who are attending Scottish universities!

Host: You know, Arul talked about that Power Point…

Xuedong Huang: Yeah.

Host: …service and he was talking about people who had hearing disabilities.

Xuedong Huang: Mm-hmm.

Host: You give it a whole new…

Xuedong Huang: It’s much broader…

Host: Oh, absolutely!

Xuedong Huang: …because the language barrier is always there. Not everyone is as fluent. And I host many visitors. Almost in every year I’m hosting Tsinghua University MBA students and they all learn English, but their ability to converse and listen, simply is not as good as native people here. So, the simple fact that we were able to provide captioning on the Power Point presentation helped all of them…

Host: Yeah.

Xuedong Huang: …to learn and understand much better. So, this is actually a fairly broad scenario without even translating. Just the fact you have captioning, we enhance the communication.

Host: Right. And you know, we talked earlier about the different languages and we talked a little bit about dialects, but we didn’t really talk about accents within language. I mean, even in the United States, you go to various parts of the country and have a more difficult time understanding, even from your own country, just because of the accent.

Xuedong Huang: That’s why my Scottish English is a good story! And I hope I still have a little bit of Scottish accent!

Host: I hear it! Well at the end of every podcast, I give my guests the last word. And since you are in human language technologies, it’s particularly apropos for you. Now’s your chance to say whatever you want to our listeners who might be interested in enabling computers to converse and communicate. What ought they to put boots on for?

Xuedong Huang: Working on speech and language! This is really the crown jewel of AI. You know, there’s no more challenging task than this one, in my opinion. Especially if you want to move from perceptive AI to cognitive AI. To get the ability to reason, to understand, to acquire knowledge by reading, by conversing, is just, you know, such a fundamental area that can improve everyone’s life, improve everyone’s productivity, make this world a much better place without language barriers, without the communication barriers, without understanding barriers.

Host: Xuedong Huang, thank you for joining us on the podcast today. It’s been fantastic.

Xuedong Huang: My pleasure.

(music plays)

To learn more about Dr. Xuedong Huang and the science of machine speech and language, visit

Français English