Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn


Dr. John Langford and Rafah Hosn

Episode 75, May 8, 2019

Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab.

Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer.



Host: Welcome to another two-chair, two-mic episode of the Microsoft Research Podcast. Today we bring you the perspectives of two guests on the topic of reinforcement learning for online applications. Since most research wants to be a product when it grows up, we’ve brought in a brilliant researcher/program manager duo to illuminate the classic “research, incubate, transfer” process in the context of real-world reinforcement learning.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab.

Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer. That and much more on this episode of the Microsoft Research Podcast.

Host: I’ve got two guests in the both today, both working on some big research problems in the Big Apple. John Langford is a partner researcher in the Machine Learning group at MSR NYC, and Rafah Hosn, also at the New York lab, is the principal program manager for personalization service, also known as real-work reinforcement learning. John and Rafah, welcome to the podcast.

John Langford: Thank you.

Rafah Hosn: Thank you.

Host: Microsoft Research’s New York lab is relatively small, in the constellation of MSR labs, but there’s some really important work going on there. So, to get us started, tell us what each of you does for a living and how you work together. What gets you up in the morning? Rafah, why don’t you start?

Rafah Hosn: Okay, I’ll start. So I wake up every day and think about all the great things that the reinforcement learning researchers are doing and first I map what they’re working on, something that could be useful for customers, and then I think to myself, how can we now take this great research, which typically comes in the form of a paper, to a prototype, to an incubation, to something that Microsoft can make money out of?

Host: That’s a big thread, starting with a little seed, and ending up with a big plant at the end.

Rafah Hosn: Yes, we have to think big.

Host: That’s right. How about you, John?

John Langford: I want to solve machine learning! And that’s ambitious, but one of the things that you really need to do if you want to solve machine learning is you need to solve reinforcement learning, which is kind of the common basis for learning algorithms to learn from interaction with the real world. And so, figuring out new ways to do this, or trying to expand the scope of where we can actually apply these techniques, is what really drives me.

Host: Can you go a little deeper into “solve machine learning?” What would solving machine learning look like?

John Langford: It would look like anything that you can pose to a machine learning problem you can solve, right? So, I became interested in machine learning back when I was an undergrad, actually.

Host: Yeah.

John Langford: I went to a machine learning class and I was like, oh, this is what I want to do for my life! And I’ve been pursuing it ever since.

Host: And here you are.

John Langford: Yeah.

Host: So, we’re going to spend the bulk of our time today talking about the specific work you’re doing in reinforcement learning. But John, before we get into it, give us a little context as a level set. From your perspective, what’s unique about reinforcement learning within the machine learning universe, and why is it an important part of MSR’s research portfolio?

John Langford: So, most of the machine learning that’s actually deployed is of the supervised learning variety. And supervised learning is fundamentally about taking expertise from people and making that into some sort of learned function that you can then use to do some task. Reinforcement learning is different because it’s about taking information from the world and learning a policy for interacting with the world so that you perform better in one way or another. So, that different source of information can be incredibly powerful, because you can imagine a future where, every time you type on the keyboard, the keyboard learns to understand you better, right? Or every time you interact with some website, it understands better what your preferences are, so the world just starts working better and better in interacting with people.

Host: And so, reinforcement learning, as a method within the machine learning world, is different from other methods because you deploy it in less-known circumstances, or how would you define that?

John Langford: So, it’s different in many ways, but the key difference is the information source. The consequence of that is that reinforcement learning can be surprising. It can actually surprise you. It can find solutions you might have not thought of to solve problems that you posed to it. That’s one of the key things. Another thing is, it requires substantially more skill to apply than supervised learning. Supervised learning is pretty straightforward as far as the statistics go, while reinforcement learning, there’s some real traps out there, and you want to think carefully about what you’re doing. Let me go into a little more detail there.

Host: Please do.

John Langford: Let’s suppose you need to make a sequence of ten steps, and you want to maximize the rewards you get in those ten steps, right? So, it might be the case that going left gives you a small reward immediately, and then you get no more rewards. While if you go right, you get no reward, and then you go left, and then right, and then right, and then left, and then right, so on, ten times… do it just the right way, you get a big reward, right? So many reinforcement learning algorithms just learn to go left, because that gave the small reward immediately. And that gap is not like a little gap. It’s like, you may require exponentially many more samples to learn unless you actually gather the information in an intelligible, conscious way.

Host: Yeah. I’m grinning, and no one can see it, because I’m thinking, that’s how people operate generally, you know? If I…

Rafah Hosn: Actually, yeah. I mean, the way I explain reinforcement learning is the way you teach a puppy how to do a trick. And the puppy may surprise you and do something else, but the reward that John speaks of is the treat that you give the puppy when the puppy does what you are trying to teach it to do, and sometimes they just surprise you and do something different. And actually, reinforcement learning has a very great affinity to Pavlovian psychology.

Host: Well, back to your example, John, you’re saying if you turn left you get the reward immediately.

John Langford: Yeah, a small reward immediately.

Host: A small reward. So, the agent would have to go through many, many steps of this to figure out, don’t go left, because you’ll get more later.

John Langford: You’ll get more later if you go right and you take the right actions after you go right.

Rafah Hosn: Now, imagine explaining this to a customer.

Host: And we will get there, and I’ll have you explain it. Rafah, let’s talk for a second about the personalization service, which is an instantiation of what you call real-world reinforcement learning, yeah?

Rafah Hosn: That’s right.

Host: So, you characterize it as a general framework for reinforcement learning algorithms that are suitable for real-world applications. Unpack that a bit. Give us a short primer on real-world reinforcement learning and why it’s an important direction for reinforcement learning in general.

Rafah Hosn: Yeah, I’ll give you my version, and I’m sure John will chime in. But, you know, many of the reinforcement learning that people hear about are almost always done in a simulated environment, where you can be creative as to what you simulate, and you can generate, you know, gazillions of samples to make your agents work. Our type of reinforcement… John’s type of reinforcement learning is something that we deploy online, and what drives us, John and I, is to create or use this methodology to solve real-world problems. And our goal is really to advance the science in order to help enterprises maximize their business objective through the usage of real-world reinforcement learning. So, when I say real world, these are models that we deploy, in production with real users, getting real feedback, and they learn on the job.

Host: Well, John, talk a little bit about what Rafah has alluded to. There’s an online, real-world element to it, but prior to this, reinforcement learning has had some big investments in the gaming space. Tell us the difference and what happens when you move from a very closed environment to a very open environment from a technical perspective.

John Langford: Yeah, so I guess the first thing to understand is why you’d want to do this, because if reinforcement learning in simulators works great, then why do you need to do something else? And I guess the answer is, there are many things you just can’t simulate. So, an example that I often give in talks is, would I be interested in a news article about Ukraine? The answer is, yes. Because my wife is from Ukraine. But you would never know this. Your simulator would never know this. There would be no way for the policy to actually learn that if you’re learning in the simulator.

Host: Right.

John Langford: So, there are many problems where there are no good simulators. And in those simulators, you don’t have a choice. So, given that you don’t have a choice, you need to embrace the difficulties of the problem. So, what are the difficulties of the real-world reinforcement learning problems? Well, you don’t have zillions of examples which are typically required for many of the existing deep reinforcement learning algorithms. You need to be careful about how you use your samples. You need to use them to maximum and utmost efficiency in trying to do the learning. Another element that happens is often, when people have simulators, those simulators are kind of effectively stationary. They stay the same throughout the process of training. But in real-world problems, many of them that we encounter, we run into all kinds of non-stationarities, these exogenous events, the algorithms need to be very robust, so the combination of using samples very efficiently in great robustness in these algorithms are kind of key offsetting elements from what you might see in other places.

Host: Which is challenging Alpha Go or Ms. Pacman or the other games that have been sort of flags waved about our progress in reinforcement learning?

John Langford: I think those are fun applications. I really enjoy reading about them and learning about them. I think it’s a great demonstration of where the field has gotten, but I feel like there’s this issue of AI winter, right? So, there was once a time when AI crashed. That may happen again, because AI is now a buzzword. But I think it’s important that we actually do things that have some real value in the world which actually affect peoples’ lives, because that’s what creates a lasting wave of innovation and puts civilization into a new place.

Host: Right.

John Langford: So that’s what I’m really seeking.

Host: What season are we in now? I’ve heard there has been more than one AI winter, and some people are saying it’s AI spring. I don’t know. Where do you see us in terms of that progress?

John Langford: I think it’s fair to say that there’s a lot of froth in terms of people claiming things that are not going to come to pass. At the same time, there is real value being created. Suddenly we can do things and things work better through some of these techniques, right? So, it’s kind of this mishmash of overpromised things that are going to fail, and there are things that are not overpromised, and they will succeed, and so if there’s enough of those that succeed, then maybe we don’t have a winter. Maybe it just becomes a long summer.

Host: Like San Diego all the time….

Rafah Hosn: Yeah, but I think, to comment on John’s point here, I think reinforcement learning is a nascent technique compared to supervised learning. And what’s important is to do the crawl, walk, run, right? So, yeah, it’s sexy now and people are talking about it, but we need to rein it in from a business perspective as to, you know, what are the classes of problems that we can satisfy the business leader with? And satisfy them effectively, right? And I think, from a reinforcement learning, John, correct me, we are very much at the crawl phase in solving generic business problems.

John Langford: I mean, we have solved some generic business problems. But we don’t have widely deployed, or deployable, platforms for reusing those solutions over and over again. And it’s so easy to imagine many more applications than people have even tried. So, we’re nowhere near a mature phase in terms of even simple kinds of reinforcement learning. We are ramping up in our ability to solve real-world reinforcement learning problems…

Host: Heading towards…

John Langford: …and there’s a huge ramp still to happen.

Host: Heading towards your goal of solving machine learning?

John Langford: Yes.

Rafah Hosn: But I mean to be fair though, we can actually satisfy some classes of problems really well with nascent technology.

Host: Yes.

Rafah Hosn: So yes, we are nascent, and the world is out there for us to conquer, but I think we do have techniques that can solve a whole swath of problems and it’s up to us to harvest that.

(music plays)

Host: Well, let’s continue the thread a little bit on the research areas of reinforcement learning. And there’re several that seem to be gaining traction. Let’s go sort of high level and talk about this one area that you’re saying is basically creating a new foundation for reinforcement learning. What’s wrong with the current foundation, what do we need the new foundation for, and what are you doing?

John Langford: The current foundation of reinforcement learning is called a Markov decision process. The idea in a Markov decision process is that you have states and actions and, given a state, you take an action, then you have some distribution over the next state. So that’s kind of what the foundation is, is to help everybody describe your solutions. And the core issue with this is that there are no good solutions when you have a large number of states. All solutions kind of scale with the number of states, and so, if you have a small number of possible observations about the world, then you can employ these theoretically motivated reinforcement learning algorithms, which are provably efficient, and they will work well. But in the real world, you have a megapixel camera which has two to the one million or sixteen to the one million possible inputs. And so, you never encounter the same thing twice, and so you just can’t even apply these algorithms. It doesn’t even make sense. It’s ridiculous. So, when I was a young graduate student, I was, of course, learning about Markov decision processes and trying to figure out how to solve reinforcement learning better with them. And then at some point, after we had a breakthrough, I realized that the breakthrough was meaningless, because it was all about these Markov decision processes. And no matter what, it just never was going to get to the point where you could actually do something useful. So around 2007, I decided to start working on contextual bandits. This is an expansion of what reinforcement learning means, in one sense, but a restriction in another sense. So instead of caring about the reward of a long sequence of actions, we’re going to care about the reward of the next action. Right? So that’s a big simplification. On the other hand, instead of caring about the state, we’re going to care about an observation and we’re going to demand that our algorithms don’t depend on the number of possible observations, just like they do in supervised learning. So, we studied this for several years. We discovered how to create statistically efficient algorithms for these kinds of problems. So that’s kind of the foundation of the systems that we’ve been working on. And then, more recently, after cracking these contextual bandit problems, we wanted to address a larger piece of reinforcement learning. So now we’re thinking about contextual decision processes where you have a sequence of rounds, and on each round, you see some observation, you choose some action, and then you do that again and again and again. And then at the end of an episode, maybe ten steps, maybe a hundred, you get a reward. Right? So, there’s some long, delayed reward dependent upon all the actions you’ve taken and all the observations you’ve made. And now it turns out that when these observations are generated by some small, underlying state space which you do not know in advance and which is never told to you, you can still learn. You can still do reinforcement learning. You can efficiently discover what a good policy is, globally. So, the new foundation of reinforcement learning is about creating a foundation for reinforcement learning algorithms that can cope with a megapixel camera as an observation rather than having like ten discrete or a hundred discrete states.

Host: And you’re getting some good traction with this approach?

John Langford: Yeah. I mean, contextual bandits are deployed in the real world and being used in many places at this point.

Host: Okay.

John Langford: There’s every reason to believe that if we can crack contextual decision processes, which is our current agenda, that will be of great use as well.

Host: Rafah, at its core, reinforcement learning systems are designed to be self-improving systems, and kind of learn from the real world like humans do.

Rafah Hosn: Yes.

Host: Or puppies.

Rafah Hosn: Or puppies.

Host: And the real world is uncertain and risky.

Rafah Hosn: Yes.

Host: So how do you, from your perspective or from your angle, build trust with the customers that you interact with, both third-party and first-party customers who are giving you access to their own real-life traffic online?

Rafah Hosn: Yeah, this is an important topic when we start looking at how we do incubations in our team. And we have a specific challenge, as you were saying, because if we were a supervised learning model, we would go to a customer and say, hey, you know, give me a data set, I’ll run my algorithm, if it improves, you deploy it, we deploy it in an A/B test, and if we are good, you’re good to go. Our system is deployed in production, so here we are with customers and talking to them about advanced machine learning techniques from research, and we want to deploy them in their online production system. So, as you can imagine, it becomes an interesting conversation. So, the way we approach this actually is by taking ideas from product teams. So, when we went and did our incubations, we did it with a hardened prototype, meaning this is a prototype that’s not your typical stitched-up Python code, that, you know is hacky. We took a fair amount of time to harden it to the degree that if you run it in production, it’s not going to crash your customers online productions system. So that’s number one. And then when we approach customers, our system learns from the real world, and you do need a certain amount of traffic because our models are like newborn puppies. They don’t know any tricks. So, you need to give them information in order to learn. But what we typically do is we have a conversation with our customer and say, hey, you know, yes, this is research, but it is hardened prototype. That’s number one. And two, we use previous incubations as reference to newer ones. We borrow ideas from how products go sell their prototypes, right? And then we, as a methodology, say to customers, when they have large volumes of traffic, to give us a portion of their traffic which is good enough for us to learn and prove the ROI, but small enough for them to de-risk. And that methodology has worked very well for us.

Host: De-risk is such a good word. Let’s go a little further on that thread. Talk a little bit about the cold start versus the warm start when you’re deploying.

Rafah Hosn: Yes, so that’s another interesting conversation with our customers, especially those that are used to supervised learning where you train your model, right, with a lot of data, and you deploy it, and it’s already learned something. Our models and our personalization service start really cold, but the way John and the teams created those algorithms allows us to learn very fast. And the more traffic you give it, the faster it learns. So, I’ll give you an example. We deployed a pilot with Xbox Top of Home where we were personalizing two of the three slots or four slots that they have on the Top of Home. And Xbox gets, you know, millions of events per day. So, with only 6 million events per day, which is a fraction of Xbox traffic, in about a couple of hours, we went from cold to very warm. So again, from a de-risking, with these conversations with our customers, first or third parties, we tend to say, yes, it’s cold start. But these algorithms learn super-fast, and there’s a certain amount of traffic flow that enables that efficient learning. So, we haven’t had major problems. We start by making our customers understand how the system works, and we go from there.

Host: Are there instances where you’re coming into a warm start where there’s some existing data or infrastructure?

John Langford: Yeah, so that definitely happens. It’s typically more trouble than it’s worth to actually use pre-existing data, because, when you’re training in a contextual bandit, you really need to capture four things: the features, the action, the reward for the action, and then the probability of taking the action. And almost always, the probability is not recorded in any kind of reliable way if it was even randomized previously. So, given that you lack one of those things, you can… there are ways to try to repair that. They kind of work, but they’re kind of a pain, and not the kind of thing that you can do in an automatic fashion. So typically, we want to start with recording our own data so we can be sure that it is, in fact, good data. Now, with that said, there are many techniques for taking into account pre-existing models, right? So, we actually have a paper now in archive talking about how to combine an existing supervised data source with a contextual bandit data source. Another approach, which is commonly very helpful, is people may have an existing supervised system which may be very complex, and they may have built up a lot of features around that…

Host: Right.

John Langford: …which may not even be appropriate. Often, there’s a process around any kind of real system where the learning algorithm and the features are kind of co-evolving, and so moving away from either of them causes a degradation in performance.

Host: Sure.

John Langford: So, in that kind of situation, what you want to do is, you want to tap the existing supervised models to extract features which are very powerful. And then, given those very powerful features, you can very quickly, get to a good solution. And then so the exact mechanism of that extraction is going to depend upon the representation that you’re using. With a neural network, you kind of rip off the top layer and use that. With a decision tree or a boosted decision tree or a decision forest, you can use the leaf membership as a feature that you can then feed in for a very fast warm-up of a contextual bandit learner.

Host: John, talk about offline experimentations. What’s going on there?

John Langford: Yeah, so this is one of the really cool things that’s possible when you’re doing shallow kinds of reinforcement learning, reinforcement learning with maybe one step or maybe two steps. So, if you record that quad of features, action, reward and the probability, then it becomes possible to evaluate any policy that chooses amongst the set of available actions. Okay? So, what that means is that, if you record this data, and then later you discover that maybe a different learning rate was helpful, or maybe you should be taking this feature and that feature and combining them to make a new feature. You can test to see exactly how that would have performed if you had deployed that policy at the time you were collecting data. So, this is amazing, because this means that you no longer need to use an A/B test for the purpose of optimization. You still have reasons to use it for purposes of safety, but for optimization, you can do that offline in a minute rather than doing it online for two weeks waiting to get the data necessary to actually learn.

Rafah Hosn: Yeah, just to pick up on, why is this a gold nugget? Data scientists spend a fair amount of time today designing models a priori and testing them in A/B tests only to learn two weeks after that they failed, and they go back to ground zero. So here you’re running hundreds, if not thousands, of A/B tests on this pod. And when we talk about this to data scientists and enterprises, their eyes light up. I mean, that is one of the key features of our system that just brightens the day for many data scientists. It’s a real pain for them to design models, run them in A/B… it’s very costly as well.

Host: Mm-hmm.

Rafah Hosn: So, talk about productivity gains. It’s immense when you can run a hundred to two hundred A/B tests in a minute versus running one A/B test for two weeks.

Host: Rafah, you work as a program manager within a research organization.

Rafah Hosn: Yes.

Host: And it’s your job to bring science to the people.

Rafah Hosn: Yes.

Host: Talk about your process, a little more in detail, of “research, incubate, transfer” in the context of how you develop RL prototypes and engineer them and how you test them, and specifically maybe you could explain a couple of examples of this process of deployments that are out there already. How are you living up to the code?

Rafah Hosn: We have a decent size engineering team that supports our RL efforts in MSR. And our job is twofold. One is to, from a program management perspective, it’s to really drive what it means to go from an algorithm to a prototype, and then validate whether that prototype has any market potential. I take it upon me, as a program manager, when researchers are creating these wonderful academic papers with great algorithms, and some of them may have huge market potential. So, this market analysis happens actually in MSR. And we ask ourselves, great algorithm; what are the classes of problems we can solve for it? And would people like relate to these problems such that we could actually go and incubate them? And the incubation is a validation of this market hypothesis. So that’s what we do in our incubations. We are actually trying to see whether this is something that we could potentially tech transfer to the product team. And we’ve done this with contextual bandits in the context of personalization scenarios. So contextual bandits is a technique, right? And so, we ask ourselves, okay, with this technique, what classes of problems can we solve very efficiently? And personalization was one of them. And we went and incubated it first with MSN. Actually, John and the team incubated it with MSN first, and they got a twenty six percent lift. That’s multi-million-dollar revenue potential. So, from a market potential, it really made sense. So, we went and said, okay, one customer is not statistically significant, so we need to do more. And we spent a fair amount of time actually validating this idea and validating the different types of personalization. So, MSN was a news article personalization. Recently, we did a page layout personalization with Japan where they had four boxes on Japan, and they were wondering how to present these boxes based on the user that was visiting that page. And guess what? We gave them two thousand five hundred events, so it was a short-run pilot that we did with them. We gave them an eighty percent lift. EIGHTY. They were flabbergasted. They couldn’t believe – and this was run on an A/B test. So they had their page layout that their designers had specified for them, for all users, running as the control, and they had our personalization engine running with our contextual bandit algorithm, and they ran it, and for us, you know, twenty-five hundred samples is not really a lot. But even with that, we gave them an eighty percent lift over their control. So, these are the kinds of incubation that, when we go to our sister product team in Redmond and tell the story, they get super excited that this could be a classes of application that could work for the masses.

(music plays)

Host: John, there’s a lot of talk today about diversity, and that often means having different people on the team, but there’s other aspects, especially in reinforcement learning that include diversity of perspective and approach. How do you address this in the work you’re doing and how do you practically manage it?

John Langford: One thing to understand is that research is an extreme sport in many ways. You’re trying to do something which nobody has ever done before. And so, you need an environment that supports you in doing this in many ways. It’s hard for a single researcher to have all the abilities that are needed to succeed. When you’re learning to do research, you’re typically learning a very narrow thing. And over time, maybe that gets a little bit broader, but it’s still going to be the case that you just know a very narrow perspective on how to solve a problem. So, one of the things that we actually do is, on a weekly basis, we have an open problems discussion where a group of researchers gets together, and one of them talks about the problem that they’re interested in. And then other people can chime in and say, oh, maybe we should look at it this way or think about it that way. That helps, I think, sharpen the problems. And then, in the process of solving problems, amazing things come up in discussion, but they can only come up if you can listen to each other. I guess the people that I prefer to work with are the ones who listen carefully. There’s a process of bouncing ideas off each other, discovering the flaws in them, figuring out how to get around the flaws. This process can go on. It’s indefinite.

Host: Yeah.

John Langford: But sometimes it lands. And when it lands, that moment when you discover something, that’s really something!

Host: Rafah, do you have anything to add to that?

Rafah Hosn: So, when I think about diversity in our lab, I think that, to compliment what John’s saying, I like to always also think about the diversity of disciplines. So, in our lab, we’re not a big lab, but we have researchers, we have engineers, we have designers, and we have program managers. And I think these skillsets are diverse, and yet they complement each other so well. And I think that adds to the richness of what we have in our lab.

Host: In the context of the work you do, its applications and implications in the real world, is there anything that keeps you up at night? Any concerns you’re working to mitigate even as you work to innovate?

John Langford: I think the answer is, yes. Uhhh… anybody who understands the potential of machine learning, AI, whatever you want to call it, understands that there are negative ways to use it…

Rafah Hosn: Right.

John Langford: …right? It is a tool and…

Host: Yeah.

John Langford: …we need to try to use the tool responsibly, and we need to mitigate the downsides where we can see them in advance. So, I do wonder about this. But recently we had a paper on fair machine learning…

Host: Mm hmm.

John Langford: …and we showed that any supervised learning algorithm can, in a black box fashion, be turned into a fair supervised learning algorithm. We demonstrated this both theoretically and experimentally. So that’s a promising paper that addresses a narrow piece of ethics around AI, I guess I would say.

Host: Yeah.

John Langford: As we see more opportunities along these lines, we will solve them.

Rafah Hosn: Yeah, also use these techniques for the social good, right? I mean, as we are trying to use them to monetize, also we should use them for the social good.

Host: How did each of you end up at Microsoft Research in New York City?

John Langford: This is actually quite a story. So, I used to be at Yahoo Research. One day, right about now, seven years ago, the head of Yahoo Research quit. So, we decided to essentially sell the New York lab. So, we created a portfolio of everybody in the New York lab. There were fifteen researchers there. We sent it around to various companies. Microsoft ended up getting thirteen out of fifteen people! And that was the beginning of Microsoft Research New York.

Host: Rafah, how did you come to Microsoft Research New York City?

Rafah Hosn: He told me I was going to revolutionize the world. That’s why I came over from IBM! So, I actually had a wonderful job at IBM, applying Watson Technologies for children’s education. And one day, a Microsoft recruiter called me, and they said, “John Langford, renowned RL researcher, is looking for a program manager. You should interview with him!” And I’m like, okay! So, I interviewed at Microsoft Research New York, spoke to many people, and at the time I, you know, I was comfortable in my job and I had other opportunities but, in his selling pitch to me, John Langford calls me one day at home and he says, “You should choose to come and work for Microsoft Research because we’re going to revolutionize the world.” And I think it sunk in that we can be at the cusp of something really big, and that got me really excited to join, and that’s how I ended up at Microsoft Research.

Host: As we close, I’d like each of you to address a big statement that you’ve made. John, you started out our interview with “I want to solve machine learning.” Rafah, you have said that your ultimate goes is “real-world reinforcement learning for everyone.” What does the world look like if each of you is wildly successful?

John Langford: Yeah, so there’s a lot of things that are easy to imagine being a part of the future world that just aren’t around now. You should imagine that every computer interface learns to adapt to you, rather than you needing to adapt to the user interface. You could imagine lots of companies just working better. You could imagine a digital avatar that, over time, learns to help you book the flights that you want to book or things like that, right? Often there’s a lot of mundane tasks that people do over and over again, and if you have a system that can record and learn from all the interactions that you make with computers or with the internet, it can happen on your behalf. That could really ease the lives of people in many different ways. Lots of things where there’s an immediate sense of, oh, that was the right outcome or, oh, that was the wrong outcome, can be addressed with just the technology that we have already. And then there’s technologies beyond that, like the contextual decision processes that I was talking about that may open up even more possibilities in the future.

Host: Rafah.

Rafah Hosn: To me, what a bright future would look like is when we can cast a lot of issues that we see today, problems enterprises and, at the personal level as a reinforcement learning problem, that we can actually solve. And more importantly for me, you know, as we work in technology and we develop all these techniques, the question is, are we making the world a better world, right? And can we actually solve some hard problems like famine and diseases with reinforcement learning? And maybe not now, but can it be the bright future that we look out for? I hope so.

Host: I do too. John Langford, Rafah Hosn, thank you for joining us today.

Rafah Hosn: Thank you.

John Langford: Thank you.

(music plays)

Host: To learn more about Dr. John Langford and Rafah Hosn and the quest to bring reinforcement learning to the real world, visit

Related publications

Continue reading

See all podcasts