Jianfeng Gao on the Microsoft Research Podcast

Episode 104 | January 29, 2020

Dr. Jianfeng Gao is a veteran computer scientist, an IEEE Fellow and the current head of the Deep Learning Group at Microsoft Research. He and his team are exploring novel approaches to advancing the state-of-the-art on deep learning in areas like NLP, computer vision, multi-modal intelligence and conversational AI.

Today, Dr. Gao gives us an overview of the deep learning landscape and talks about his latest work on Multi-task Deep Neural Networks, Unified Language Modeling and vision-language pre-training. He also unpacks the science behind task-oriented dialog systems as well as social chatbots like Microsoft Xiaoice, and gives us some great book recommendations along the way!



Jianfeng Gao: Historically, there are two approaches to achieve the goal. One is to use large data. The idea is that if I can collect all the data in the world, then I believe the representation learned from this data is universal. Because I see all of them. The other approach is that, since the goal of this representation is to serve different applications, how about I train the model using application-specific objective functions across many, many different applications?

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Dr. Jianfeng Gao is a veteran computer scientist, an IEEE Fellow and the current head of the Deep Learning Group at Microsoft Research. He and his team are exploring novel approaches to advancing the state-of-the-art on deep learning in areas like NLP, computer vision, multi-modal intelligence and conversational AI.

Today, Dr. Gao gives us an overview of the deep learning landscape and talks about his latest work on Multi-task Deep Neural Networks, Unified Language Modeling and vision-language pre-training. He also unpacks the science behind task-oriented dialog systems as well as social chatbots like Microsoft Xiaoice, and gives us some great book recommendations along the way! That and much more on this episode of the Microsoft Research Podcast.

Host: Jianfeng Gao, welcome to the podcast.

Jianfeng Gao: Thank you.

Host: So you’re a Partner Research Manager of the Deep Learning Group at MSR. What’s your big goal as a researcher yourself, and what’s the big goal of your group? What gets you up in the morning?

Jianfeng Gao: It’s like all the world-class research teams, our goal, ultimate goal, is to advance the state-of-the-art and we want to push the AI frontiers by using deep learning technology or developing new deep learning technologies. That’s the goal I think every group has.

Host: Right.

Jianfeng Gao: But for us, because we are a group at Microsoft, we also have a mission to transfer the latest deep learning and AI technologies into Microsoft products so that we can benefit millions of Microsoft users.

Host: Well, interestingly, as you talk about the Deep Learning Group, as I understand it, that’s a relatively new group here at Microsoft Research, but deep learning is not a new thing here so tell us how and why this group actually came about?

Jianfeng Gao: Yeah, deep learning has a long history, but actually in the first deep learning model, at that time it was called a neural net model, was developed a half century ago.

Host: Right.

Jianfeng Gao: But at the time, because the training data available for large-scale model learning is not available, so the performance of the these neural net models are not as good as the state-of-the-art model at that time.

Host: Okay.

Jianfeng Gao: So deep learning only, I think, took off in the last decade when the large amounts of training data is available and the large-scale training infrastructure, computing training infrastructure, is available.

Host: Okay.

Jianfeng Gao: At Microsoft, deep learning also has a long history. I remember back to 2012, the speech group at Microsoft Research already demonstrated the power of deep learning by applying them to acoustic modeling. They were able to reduce the error rate of the speech recognition system by about ten percent to fifteen percent. That was considered a very significant milestone.

Host: Right.

Jianfeng Gao: At that time, after almost ten years’ hard work without any significant improvement because deep learning hit the bar. So then in two years, the vision team, computer vision team at Microsoft, developed an extremely deep model called ResNet and they reached human parity and won a lot of competitions. And I think the first deep learning group at Microsoft Research was founded back to 2014. At that time, our focus is to develop new deep learning technologies for natural language processing and web search and a lot of business applications. In the beginning, we think that deep learning can not only used to push the frontier of AI, but also to benefit Microsoft products. So there are two parts in the deep learning group. One is the research part. The other is the incubation part.

Host: Okay.

Jianfeng Gao: I was managing the incubation part and then Dr. Li Deng was managing the research part. Then after two or three years, the incubation starts to show very promising business results internally, so they moved the team to an independent business incubation division. Then, in some sense, the big deep learning team is split into two parts. Then, later on, they moved my team to Dynamics asking me to build real products for customers. And at that time I had to make a choice so I either stay there to be a general manager of the new product team or move back to MSR. So I decided to move back last year. So last year we built a new deep learning group. This is probably the biggest research team at MSR AI.

Host: Talk a little bit more granularly about deep learning itself and how your particular career has ebbed and flowed in the deep learning world.

Jianfeng Gao: I joined Microsoft almost twenty years ago. The speech group was my first team. I worked on speech, then I worked on natural language processing, web search, machine translations, statistical machine learning and even, you know, intelligent sales and marketing. But I touched deep learning back to 2012 when Li Deng introduced to me the speech deep learning model. At that time I remember he was super excited and ran into my office saying, oh, we should build a deep learning model for natural language processing. I said, oh, I don’t believe that. But anyway, we tried it. The first deep learning model we developed is called DSSM. Stands for Deep Structural Simulated Model. The idea is very simple. We take the web search scenario as a test case. The idea is that you have a query, you want to identify relevant documents, but unfortunately, the documents are written by the author. Query issued by the users using very, very different vocabulary and language. There’s a mismatch. So the deep learning idea is to map both query and document into a common vector space we call sematic space. In that space all these concepts are represented using vectors and the distance between vectors measures the sematic similarity. The idea is very straightforward. Fortunately, we got a lot of Bing click data. User issue a query and they click a document.

Host: Right.

Jianfeng Gao: These are weak supervision training data. We have tons of this and then we train the deep learning model called DSSM. It’s fantastic. Encouraged by this result, we decided to form a deep learning team. The key concept of deep learning is representation learning. You know, let’s take natural language as an example, okay? Let’s say natural language sentence consists of words and phrases. These are symbolic tokens. The good thing about these symbolic tokens is that people can understand them easily. But they are discrete. Meaning that, if you are given two words, you want to ask a question: how similar they are. Deep learning is trying to map all these words into semantic representations so that you can measure the sematic similarity. And this mapping is done through a non-linear function, and the deep learning model, in some sense, is an implementation of this non-linear function.

Host: Okay.

Jianfeng Gao: And it’s a very effective implementation, in the sense that you can add more, more, more layers, make them very deep, and you have a different model architecture to capture different aspects of the input and even identify the features at a different abstract level. Then this model needs large amounts of data to train. I think a half century ago, we don’t have the compute power to do this. Now we have. And we also have large amounts of training data for this.

Host: Yeah.

Jianfeng Gao: That’s why I think this deep learning take off.

Host: Okay. Well, let’s talk a little bit about these representations and some of the latest research that’s going on today. In terms of the kinds of representations you’re dealing with, we’ve been talking about symbolic representations, both in language and mathematics, and you’re moving into a space where you’re dealing more with neural representations. And those two things – that architecture is going to kind of set the stage for the work that we’re going to talk about in a minute, but I would like you to talk a little bit about both the definitions of symbolic representations and neural representationsand why these neural representations represent an interesting, and possibly fruitful, line of research?

Jianfeng Gao: Let’s talk about two different spaces. One is called symbolic space. The other is the neural space. They have different characteristics. The symbolic space, take natural language as an example, is what we are familiar with, where the concepts are represented using words, phrases and sentences. These are discrete. The problem of this space is that natural language is highly ambiguous, so the same concept can be represented using very different words and phrases. And the same words or sentence can mean two or three different things given the context, but in the symbolic space it’s hard to tell.

Host: Yeah.

Jianfeng Gao: In the neural space it’s different. All the concepts are going to be represented using vectors, and the distance between vectors measures the relationship at the sematic level. So we already talked about representation learning, which is the major task of deep learning.

Host: Yeah.

Jianfeng Gao: Deep learning, in some sense, is to map all the knowledge from the symbolic space to neural space because in the neural space, all the concepts are represented using continuous vectors. It’s a continuous space. It has a lot of very nice mass properties. It’s very easy to train. That’s why, if you have a large amount of data and you want to train a highly non-linear function, it’s much easier to do so in the neural space than in the symbolic space, but the disadvantage of the neural space is it’s not human comprehensible. Because if I give you, say, okay, these two concepts are similar because the vectors of their representation are close to each other. How close they are? I don’t know. It’s hard to explain!

Host: It’s uninterpretable.

Jianfeng Gao: It’s not interpretable. At all. That’s why people believe that the neural net model is like a black box.

Host: Okay.

Jianfeng Gao: It can give you very precise prediction, but it’s hard to explain how the model came up with the prediction. This applies to some tasks like image recognition. Deep learning model does great job for tasks like this, but give a different task, like math task. If I give you problem statement like, let’s say the population of a city is five thousand , it increases by ten percent every year. What’s the population after ten years? The deep learning would try to just map this text into a number without knowing how the number is come up with, but in this particular case, we need neural symbolic computing. Ideally, you need to identify how many steps you need to take to generate the result. And for each step, what are the functions? So this is a much tougher task.

Host: Right.

Jianfeng Gao: I don’t think the current deep learning model can solve.

Host: All right, so, but that is something you’re working on?

Jianfeng Gao: Yes.

Host: You’re trying to figure out how you can move from symbolic representations to neural representations and also have them be interpretable?

Jianfeng Gao: Yes, exactly.

Host: Big task.

Jianfeng Gao: Yeah, yeah. There’s a book called Thinking Fast and Slow. In that book it also describes two different systems that drive the way we think. They call this System One and System Two. System One is like very intuitive, fast and emotional. So you ask me something. I don’t need to think. I give you answer immediately because I already answered similar questions many, many times.

Host: Right.

Jianfeng Gao: System Two is slower, more logical, more derivative. It’s like you need some reasoning such as the question I just asked, right, the math problem of the population of the city. You need to think harder. I think most of the state-of- the-art deep learning models are like System One. It trains on large amounts of training data. Each training example is input-output pairs. So the model learns the mapping between input-output by fitting a non-linear function on the data. That’s it. Without knowing how exactly the result is generated, but now we are working on, in some sense, System Two. That’s neural symbolic. You not only need to identify to generate an answer, but also needs to figure out the intermediate steps you follow to generate the answer.

(music plays)

Host: Your group has several areas of research interest and I want you to be our tour guide today and take us on a couple of excursions to explore these areas. And let’s start with an area called neural language modeling. So talk about some promising projects and lines of inquiry, particularly as they relate to neural symbolic reasoning and computing.

Jianfeng Gao: Neural language model is not a new topic. It’s been there for many years. Only recently Google proposed a neural language model called BERT. It achieves state-of-the-art results on many NLP tasks because they use a new neural network architecture called a transformer. So the idea of this model is representation learning. Whatever text they take, they will represent using vectors. And we are working on the same problem, but we are taking a different approach. So we also want to learn representations and then try to make the representations as universal as possible in the sense that the same representation can be used by many different applications. Historically, there are two approaches to achieve the goal. One is to use large data. The idea is that if I can collect all the data in the world, then I believe the representation learned from this data is universal. Because I see all of them. The other approach is that, since the goal of this representation is to serve different applications, how about I train the model using application-specific objective functions across many, many different applications? So this is called multi-task learning. So Microsoft Research is taking the multi-task learning approach. So we have a model called MT-DNN, Unified Language Model.

Host: So that’s MT-DNN, so multi-task…?

Jianfeng Gao: Stands for Multi-Task Deep Neural Network. They, for those two models, the multi-task learning is applied at a different stage. The pre-training stage and the fine-tuning stage. Yeah. So this is the neural language model part.

Host: Okay.

Jianfeng Gao: But mainly I would say this is still like System One.

Host: Still back to the thinking fast?

Jianfeng Gao: Yeah, thinking fast. Fast thinking…

Host: Gotcha. That’s a good anchor. Well, let’s talk about an important line of work that you’re tackling and it falls under the umbrella of vision and language. You call it VL.

Jianfeng Gao: Uh-huh. Vision-language.

Host: Visionlanguage. Give us a snapshot of the current VL landscape in terms of progress in the field and then tell us what you’re doing to advance the stateoftheart.

Jianfeng Gao: This is called vision-language, the idea is the same. We still learn the representation. Now, since we are learning a hidden sematic space where all the objects would be represented as vectors no matter the original media of the object. It could be a text. It could be an image. It could be a video. So, remember we talked about the representation learning for natural language?

Host: Right.

Jianfeng Gao: Now we extend the concept. Extend the modality for natural language to multi-modality to handle natural language, vision and video. The idea is, okay, give me a video or image or text, I will represent them using vectors.

Host: Okay.

Jianfeng Gao: By doing so, if we do it correctly, then this leads to many, many interesting applications. For example, you can do image search. You just put a query. I want an image of sleeping. It will return all these images. See that’s cross modality because the query is in natural language and the return result is an image. And you can also do image captioning, for example.

Host: Okay.

Jianfeng Gao: It can be an image.

Host: Right.

Jianfeng Gao: And the system will generate a description of the image automatically. This is very useful for, let’s say, blind people.

Host: Yeah.

Jianfeng Gao: Yeah.

Host: Well, help me think though, about other applications.

Jianfeng Gao: Other applications, as I said…

Host: Yeah.

Jianfeng Gao: …for blind people, we have a big project called the Seeing AI.

Host: Right.

Jianfeng Gao: The idea is, let’s say if you are blind, you’re walking on the street and you’re wearing a glass. The glass would take pictures of the surroundings for you and immediately tell you, oh, there’s a car, there’s a boy…

Host: So captioning audio?

Jianfeng Gao: Audio. And tell you what happens around you. Another project we are working on is called Visual Language Navigation. The idea is we build a 3D environment. It’s a simulation, but it’s a 3D environment. And they put a robot there. It’s an agent. And you can ask the agent to achieve a task by giving the agent natural language instructions: okay, go upstairs, turn left, open the door, grab a cup of coffee for me. Something like that. This is going to be very, very useful for scenarios like mixed-reality, and HoloLens.

Host: I was just going to say, you must be working with a lot of the researchers in VR and AR.

Jianfeng Gao: Yes. These are sort of potential applications, but we are at the early stage of developing this core technology in the simulated environment.

Host: Right. So you’re upstream in the VL category and as it trickles down into the various other applications people can adapt the technology to what they’re working on.

Jianfeng Gao: Exactly.

Host: Let’s talk about the third area, and I think this is one of the most fascinating right now, and that’s Conversational AI. I’ve had a couple people on the podcast already who’ve talked a little bit about this. Riham Mansour and Patrice Simard, who’s head of the Machine Teaching Group.

Jianfeng Gao: Yeah.

Host: But I’d like you to tell us about your work on the neural approaches to Conversational AI and how they’re instantiating in the form of question answering agents, task oriented dialog systems, or what we might call bespoke AI, and bots… chatbots.

Jianfeng Gao: Yeah, these are all obviously different types of dialogs. Social chatbots is extremely interesting. Do you know Microsoft Xiaoice?

Host: I know of it.

Jianfeng Gao: Yeah, it’s a very popular social chatbot, it has attracted more than six hundred million users.

Host: And is this in China or worldwide?

Jianfeng Gao: It’s deployed in five different countries. So it has Chinese version, it has Japanese version, English version. It does have five different languages.

Host: Wow.

Jianfeng Gao: Yeah, it’s very interesting.

Host: Do you have it?

Jianfeng Gao: I have it on my WeChat.

Host: All right, so tell me about it.

Jianfeng Gao: Yeah, this is AI agent, but the design goal of this social chatbot is different from let’s say task-oriented bot. Task oriented is mainly to help you accomplish a particular task. For example, you can use it to book a movie ticket, reserve a table in the restaurant…

Host: Get directions…

Jianfeng Gao: Yeah, get directions. And the social chatbot is designed as an AI companion, which can eventually establish emotional connections with the user.

Host: Wow.

Jianfeng Gao: So you can treat it as a friend, as your friend.

Host: So an AI friend instead of an imaginary friend.

Jianfeng Gao: Yes, it’s an AI friend. It can chat with you about all sorts of topics. It can also help you accomplish your tasks if they’re simple enough.

Host: Right now I want to dive a little deeper on the topic of neural symbolic AI and this is proposing an approach to AI that borrows from mathematical theory on how the human brain encodes and processes symbols. And we’ve talked about it a little bit, but what are you hoping that you’ll accomplish with neural symbolic AI that we aren’t accomplishing now?

Jianfeng Gao: As I said, the key difference between this approach with just the regular deep learning model is the capability of reasoning. The deep learning model is like black box you cannot open. So you take input and get output. This model can, on-the-fly, identify the necessary components and assemble them on-the-fly. That’s the key difference. In the old deep learning model, it’s just one model: black box. Now it’s not a black box. It’s actually exactly like what people are thinking.

Host: Mmm-hmm.

Jianfeng Gao: When you face a problem, first if all you divide and conquer, right? You divide a complex problem into smaller ones. Then, for each smaller one you identify, you’re searching your memory, identify the solution. And you assemble all these solutions together to solve a problem. This problem could be unseen before. It could be a new problem.

Host: Right.

Jianfeng Gao: That’s the power of the neural symbolic approach.

Host: So it sounds like, and I think this kind of goes back to the mission statement of your group, is that you are working with deep learning toward artificial general intelligence?

Jianfeng Gao: Yeah. This is a very significant step toward that, and it’s about the knowledge re-usability, right? By learning the capability of decomposing complex problem into simpler ones, you know how to solve a new complex problem and reuse the existing technologies. This is the way we solve that problem.

Host: Okay.

Jianfeng Gao: I think the neural symbolic approach tries to mimic the way people solve problems.

Host: Right.

Jianfeng Gao: People… as I said, it’s like System One, System Two… For these sophisticated problems, people’s system is like System 2.

Host: Right.

Jianfeng Gao: You need to analyze the problem, identify the key steps, and then, for each step, identify the solution.

Host: All right, so our audience is very technical and I don’t know if you could go in to a bit of a deeper dive on how you’re doing this – computationally, mathematically – to construct these neural symbolic architectures?

Jianfeng Gao: Yeah, there are many different ways, and the learning challenge is that we have a lot of data, but we don’t have the labels for the intermediate steps. So the model needs to learn these intermediate steps automatically. In some sense, these are hidden variables. There are many different ways of learning this.

Host: Right.

Jianfeng Gao: So there are different approaches. One approach is called reinforcement learning. You try to assemble different ways to generate an answer and if it doesn’t give you an answer, you trace back and try different combinations. So yeah, that’s one way of learning this. As long as the model has the capability of learning all sorts of combinations in very efficient ways, we can solve this problem. The idea is, if you think about how people solve sophisticated problems, when we’re young, we learn to solve these simple problems. Then we learn the skill. Then we combine these basic skills to solve more sophisticated ones. We try to mimic the human learning pattern using the neural symbolic models.

Host: Mmm-hmm.

Jianfeng Gao: So in that case, you don’t need to label a lot of data. You label some. Eventually, the model learns two things. One is, it learns to solve all these basic tasks, and more importantly, the model is going to learn how to assemble these basic skills to solve more sophisticated tasks.

Host: The idea of pre-training models is getting a lot of attention right now and has been framed as “AI in the big leagues” or “a new AI paradigm” so talk about the work going on across the industry in pre-trained models and what MSR is bringing to the game.

Jianfeng Gao: The goal of these pre-training models is to learn a universal representation of the natural language. Then there are two strategies of learning to the universal representation. One is to train the model on large amounts of data. If you get all the data in the world you can be pretty sure the model trained is universal.

Host: Right.

Jianfeng Gao: The other is multi-task learning. And the Unified Language Model is using the multi-task learning in the pre-training stage.

Host: Okay.

Jianfeng Gao: We group the language model into three different categories. Given the left and right to predict the word in the middle, that’s one task. The other task is, given the input sentence, produce the output sentence. Second. The third tasks is, given a sequence, you always want to predict the next word based on the history. So these are three very different tasks that cover a lot of natural language processing scenarios. And we use multi-task learning for this Unified Language Model. Given the training data we, you know, use three different objective functions to learn jointly…

Host: Okay.

Jianfeng Gao: …the model parameters. The main advantage of the Unified Language Model is that it can be applied to both natural language understanding tasks and the natural language generation tasks.

(music plays)

Host: AI is arguably the most powerful technology to emerge in the last century and it’s becoming ubiquitous in this century. Given the nature of the work you do, and the potential to cause big disruptions both in technology and in the culture, or society, is there anything that keeps you up at night? And if so, how are you working to anticipate and mitigate the negative consequences that might result from any of the work you’re putting out?

Jianfeng Gao: Yeah, there are a lot of open questions. Especially, at Microsoft, we are building AI products for millions of users, right? All our users are very different. Take Microsoft Xiaoice, the chatbot system, as an example. In order to, you know, have a very engaging conversation, sometimes the Xiaoice system will tell you some joke. You may find the joke very interesting, funny, but other people may find the joke offensive. That’s about culture. It’s very difficult to find the trade-off. You want the conversation interesting enough so that you can engage with the people, but you also don’t want to offend people. So there are a lot of guidance about who is in control. For example, if you want to switch a topic, do you allow your agent to switch a topic or agent always follow the topic…

Host: Of the user…

Jianfeng Gao: …of the user? And generally, people agree that, for all the human/machine systems, human needs to be in control all the time. But in reality there are a lot of exceptions for what happens if the agent notices the user is going to hurt herself.

Host: Right.

Jianfeng Gao: For example, in one situation, we found that the user talked to Xiaoice for seven hours. It’s already 2 am in the morning. Xiaoice forced the user to take a break. We have a lot of, sort of, rules embedded into the system to make sure that we build a system for good. People are not going to misuse the AI technology for something that is not good.

Host: So are those, like you say, you’re actually building those kinds of things in like, go to bed. It’s past your bedtime…?

Jianfeng Gao: Mmm-hmm. Or something like that, yeah. I just remind you.

Host: Right. So let’s drill in a little on this topic just because I think one of the things that we think of when we think of dystopic manifestations of a technology that could convince us that it’s human… Where does the psychological…

Jianfeng Gao: I…. I think the entire research committee is working together to set up some rules, to set up the right expectations for our users. For example, one rule I think, I believe is true, is that you should never confuse users. She’s talking to a bot… or real human. You should never confuse users.

Host: Forget about Xiaoice for now and just talk about the other stuff you’re working on. Are there any, sort of, big issues in your mind that don’t have to do with, you know, users being too long with a chatbot or whatever, but kinds of unintended consequences that might occur from any of the other work?

Jianfeng Gao: Well, for example, with respect to the deep learning model, right?

Host: Right.

Jianfeng Gao: Deep learning model is a very powerful of predicting things. People use deep learning models for recommendations all the time, but there’s a very serious limitation of these models, is that the model can learn correlation, but not causation. For example, if I want to hire a software developer, then I’ve got a lot of candidates. I ask the system to give me a recommendation. The deep learning model gives me a recommendation, and says, oh, this guy’s good. And then I ask the system, why? Because the candidate is a male. And people are, your system is wrong; it’s biased. But actually, the system is not wrong. The way we use the system is wrong. Because the system learns the strong correlation between the gender and the job title, but there’s no causality. The system does not have the causality at all. A famous example is, you know, there’s a strong correlation between the rooster’s crow and the sunrise, but it does not cause the sunrise at all! These are the problems of these deep learning models. People need to be aware of the limitations of the models so that they do not misuse them.

Host: So one step further, are there ways that you can move towards causality?

Jianfeng Gao: Yes, there are a lot of ongoing works. There’s a recent book called The Book of Why.

Host: The Book of Why.

Jianfeng Gao: Yeah, The Book of Why by Judea Pearl. There are a lot of new models he’s developing. One of the popular models is called the Bayesian network. Of course, the Bayesian network can be used in many applications, but he believes this at least is a promising tool to implement the causal models.

Host: I’m getting a reading list from this podcast! It’s awesome. Well, we’ve talked about your professional path, Jianfeng. Tell us a little bit about your personal history. Where’d you grow up? Where did you get interested in computer science and how did you end up in AI research?

Jianfeng Gao: I was born in Shanghai. I grew up in Shanghai and I studied design back to college. So I was not a computer science student at all. I learned to program only because I want to date a girl at that time. So I needed money!

Host: You learned to code so you could date a girl… I love it!

Jianfeng Gao: Then, when I was graduating in the year 1999, Microsoft Research founded a lab in China and I sent them my resume and I got a chance to interview and they accepted my application. That’s it. Now, after that, I started to work on AI. Before that, I knew little about AI.

Host: Okay, back up a little. What was your degree in? Design?

Jianfeng Gao: I got undergraduate in design. Bachelor degree in design. Then I got electronic… I got a Double E.

Host: So electronic engineering?

Jianfeng Gao: Yeah, then computer science a little bit later because I got interested in computer science after… Finally I got a computer science degree.

Host: A PhD?

Jianfeng Gao: A PhD, yeah.

Host: Did you do that in Shanghai or Beijing?

Jianfeng Gao: Shanghai.

Host: So 1999, you came to Microsoft Research.

Jianfeng Gao: Yeah, in China.

Host: Okay, and then you came over here, or…

Jianfeng Gao: Then in 2005 I moved to Redmond and joined a product group at that time. My mission at that time was to build the first natural user interface for Microsoft Windows Vista. And we couldn’t make it! And after one year, I joined the Microsoft Research here…

Host: All right!

Jianfeng Gao: …as there are a lot more fundamental work to do before can build a real system for users.

Host: “Let’s go upstream a little…” Okay.

Jianfeng Gao: Then I worked for eight years at Microsoft Research in the EPIC group.

Host: And now you’re Partner Research Manager for the Deep Learning Group…

Jianfeng Gao: Yeah. Yeah, yeah, yeah…

Host: What’s one interesting thing that people don’t know about you? Maybe it’s a personal trait or a hobby or side quest, that may have influenced your career as a researcher?

Jianfeng Gao: I remember, when I interviewed for Microsoft Research, during the interview, I failed almost all the questions and finally I said okay, it’s hopeless. I went home, and the next day I got a phone call saying you’re hired. In retrospect, I think I did not give the right answer, I asked the right questions during the interview. I think it is very important for researchers to learn how to ask the right questions!

Host: That’s funny. How do you get a wrong answer in an interview?

Jianfeng Gao: Because I was asked all the questions about the speech and natural language. I had no idea at all. I remember, at that time, he asked me to figure out an algorithm called Viterbi. I never heard of that. Then I actually asked a lot of questions. And he answered part of them. Then later he said, I cannot answer more questions because if I answer this question, you will get the answer. That shows I asked the right questions!

Host: Let’s close with some thoughts on the potential ahead. And here’s your chance to talk to would be researchers out there who will take the AI baton and run with it for the next couple of decades. What advice or direction would you give to your future colleagues, or even your future successors?

Jianfeng Gao: I think, first of all, you need to be passionate about research. It’s critical to identify the problem you really want to devote your lifetime to work on. That’s number one. Number two: after you identify this problem you want to work on, stay focused. Number three: keep your eyes open. That’s my advice.

Host: Is that how you did yours?

Jianfeng Gao: I think so!

Host: Jianfeng Gao, thank you for joining us today!

Jianfeng Gao: Thanks for having me.

(music plays) 

To learn more about Dr. Jianfeng Gao and how researchers are going deeper on deep learning, visit Microsoft.com/research