Microsoft Research Podcast

Microsoft Research Podcast

An ongoing series of conversations bringing you right up to the cutting edge of Microsoft Research.

Hybrid Reward Architecture and the Fall of Ms. Pac-Man with Dr. Harm van Seijen

December 6, 2017 | By Microsoft blog editor

Episode 3, December 6, 2017

Hybrid Reward Architecture and the Fall of Ms. Pac-Man with Dr. Harm van Seijen

If you’ve ever watched King of Kong: Fistful of Quarters, you know what a big deal it is to beat a video arcade game that was designed not to lose. Most humans can’t even come close. Enter Harm van Seijen, and a team of machine learning researchers from Microsoft Research Montreal. They took on Ms. Pac-man. And won. Today we’ll talk to Harm about his work in reinforcement learning, the inspiration for hybrid reward architecture, visit a few islands of tractability and get an inside look at the science behind the AI defeat of one of the most difficult video arcade games around.

To find out more about Harm van Seijen and the groundbreaking work going on at Microsoft Research Montreal, visit


Podcast Transcript

Harm van Siejen: Rather than trying to mimic how the brain works, it’s more being inspired by how the brain works. This is true for neural networks for example. They are also based on how our brain processeses information. It doesn’t mean that it’s an exact copy of how the brain works. That’s not the goal. I mean, machines have different capabilities, so it’s not so much about trying to mimic exactly the human brain rather than being inspired.

Host: You’re listening to the Microsoft Research Podcast. A show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

If you ever watched King of Kong fiscal of quarters, you know what a big deal it is to beat a video arcade game that was designed not to lose. Enter Dr. Harm Van Seijen and a team of machine learning researchers from Microsoft Maluuba in Montreal. They took on Ms. Pac-Man. And won.

Today we’ll talk to Harm about his work in reinforcement learning, the inspiration for Hybrid Reward Architecture, visit a few islands of tractability and get an inside look at the science behind the AI defeat of one of the most difficult video arcade games around.

That and much more on this episode of the Microsoft Research Podcast.

Host: Harm, give our listeners a brief description of the kind of work you do.

Harm van Seijen:       I work on reinforcement learning which is a sub-class of machine learning that focuses on learning good behaviors. So, it’s a very powerful method. You can use it in many different instances. But it’s, there’s still a lot of research to make sure that it can be applied in its broadest setting. So we are working on those challenges to remove those obstacles so it can be applied in a very broad way.

Host:       Why is machine learning and particularly reinforcement learning such an important field in artificial intelligence?

Harm van Seijen:       So, with more and more data, you can just build more and more complex systems. So when your systems become more complex, at a certain moment, you can just not code everything by hand anymore, but you want it to learn automatically. For example, classifying images, to build a classifier that can recognize certain objects, that if you would do that by hand, it would be very complex. So if you do it by machine learning you can automatically learn it. So this helps you to build a very complex classifier that you couldn’t otherwise encode by hand. And so with reinforcement learning, it’s a similar thing, but it’s then about behaviors. In behaviors, it’s about taking actions. So you can for example, it’s using Alpha Go. You are able to build a Go player that is much stronger than humans. Because it learns automatically, you can build something that is much better than humans.

Host:       One of your big achievements recently is beating the game of Ms. Pac-Man.  Why are video games so suitable for exploring artificial intelligence?

Harm van Seijen:       Well, they are suitable because they give you a very controlled environment to test certain ideas. So if you are dealing with applications in the real world, you then have to tackle the full complexity of your problem all at once. Whereas in a game, you can play with how complex you make your problem, and it’s a more controlled environment to test certain ideas. Also because you can run it faster than real time for example, so you can very quickly have a very quick turnaround time for building algorithms.

Host:       So you are running the game at a faster speed than the game normally goes?

Harm van Seijen:       Yeah, exactly. So for example, if we take the example of Go as an example, so if you would play Go in real time, it’s a very slow game and maybe a game lasts an hour or two hours. If you play it in an artificial environment, you can run it much faster, so you can play many, many games in the same amount of time that you would play a single game in real time. So yeah, it can give you a big speed-up in that scenario.

Host:       But Ms. Pac-Man for example, which is a different kind of a game obviously, it’s a video game, that moves super-fast to begin with. Is it the same sort of incremental speed increase relevant there?

Harm van Seijen:       Yeah, I mean, because you always want to go faster. So Ms. Pac-Man we can run much faster than real time. We can run it maybe 30 or 40 times as fast as real time. So it means that your total computation time, if in real time it would take a month, if it’s 30 times as fast, it would take you a single day. So it makes a big difference.

Host:       This is where machines are more capable than we are currently in terms of speed of processing and calculation and things like that.

Harm van Seijen:       Yes. It’s interesting because if you look at games, then, there are a couple of aspects where machines really has an advantage and there are a couple of aspects where a human has an advantage. So if you have a game that is challenging mainly because it requires a very high reaction time, then that would be a game where machines naturally have an advantage. Versus, on the other hand, you have games that really require long-term reasoning and those are games that are very suitable for humans but these are much harder for machines.

Host:       I’m wondering if we can make them feel inferior, but I don’t think they have those kinds of feelings at the current time.

Harm van Seijen:       Not yet, no.

Host:       That’s what we’re working for, right?  Listen, as a level set, kind of talk to me a little bit about the broader world of machine learning right now and differentiate some of the concepts.

Harm van Seijen:       Within machine learning you have different sub-problems basically. So the three big ones are supervised learning, unsupervised learning and reinforcement learning. So those are really problem definitions. For example, reinforcement learning tackles the problem of an agent that interacts with an environment. And if you compare that with deep learning for example, so deep learning is really – it’s not so much a problem definition as it is a technique. So it’s a particular technique to do function approximation; in particular having many different layers of neural networks for example. So it’s a technique you can use on these different problem instances. So if you combine deep learning with reinforcement learning, you get something called deep reinforcement learning and it just means that you are tackling the problem of reinforcement learning using a function approximation that uses deep learning.

Host:       What is hot right now?  Or are they all – are there areas in machine learning that are really interesting to a lot of people?

Harm van Seijen:       Yes. So deep learning really had a big boom a couple of years ago. So that’s like super hot right now. It had like a long-history like in the 80s, it was also popular, but then it kind of died down again. And so the most recent boom was a couple of years ago when they discovered how you could build much deeper networks and much more powerful networks. Deep learning has received a big boom very recently. And I think reinforcement learning is just on the brink of breaking through. The most recent two years, a lot of companies have become very interested in reinforcement learning as well. So I think that’s the next big thing.

Host:       It seems like as you say it, it makes me think that the booms come, when a particular researcher or a group of researchers or even inventors if you will, make a breakthrough in it and then everyone pays attention. It’s like hey that’s new, that’s interesting, let’s go on that thread. Where are you with reinforcement learning in that process? Are you still on kind of the research breakthrough phase?

Harm van Seijen:       So, in terms of maturity I think it’s much less mature. It’s much more still in the research phase than something like deep learning. So there’s still a lot of problem instances that we cannot solve yet or not solve well yet. So there are a couple of islands of tractability that within certain problem instances in reinforcement learning that we can solve. So in particular, if you consider the bandit problem which is a special case of a reinforcement learning problem. That is one that we can do very well and it applies for example in ad placement. So placing ads, showing ads on a website that can be modeled as a bandit problem. So it’s being used there in real products. So there are some subset of reinforcement learning we can already use and we can do well and use in real products. But for the most part, it’s still a research effort.

Host:       Right. Not infancy necessarily, but certainly not mature.

Harm van Seijen:       Yes, absolutely.

Host:       Did you say islands of tractability?

Harm van Seijen:       Yes. So just certain problem instances that we have a good handle on.

Host:       Okay, good. I think I’ve been stuck on an island of intractability before, but…

Host:             When we talked before, you said your work encompasses computer science and neuro science. In essence you are drawing on theories of how the human brain works.  How are you applying these theories to your work with machines?

Harm van Seijen:       I would more see it as rather than trying to mimic how the brain works, it’s more being inspired by how the brain works. This is true for neural networks for example. They are also based on how our brain processes information. It doesn’t mean that it’s an exact copy of how the brain works. That’s not the goal. I mean, machines have different capabilities so it’s not so much about trying to mimic exactly the human brain rather than being inspired. That also holds for certain algorithms in reinforcement learning. It’s more being inspired by how we think, decision making in human’s work than trying to make an exact copy of that.

Host:       That’s interesting.  Speaking of inspiration, how did you come up with the idea of Hybrid Reward Architecture? What was the inspiration behind that?

Harm van Seijen:       So the inspiration really came from how humans cooperate to build great products. For example, if you have a smart phone, it’s a great piece of technology. Many, many people were involved in building it. And there is not really a single person that knows how to make a smart phone. But it’s really the group of persons, all of them that each have their own expertise that know how to make a smart phone. So we wanted to build something similar where if you are trying to solve a very complex task, to have a bunch of different artificial agents that work together and each agent is focused on a different aspect of the task. So each agent has a different expertise. And then by combining those agents in a particular way, they show an overall behavior that is very intelligent.

Host:       So it’s kind of the distributed expertise model of business as it were.

Harm van Seijen:       Right.

Host:       But only with artificial intelligence agents within a program?

Harm van Seijen:       Yes, that’s right.

Host:       So let’s talk about that for a second. Because when we were talking about HRA before, you mentioned the importance. Because you’ve got these agents acting individually. You program them to do specific tasks. But then there’s a necessity like there is in a business for a boss to make a decision. So explain how that works with inside this Hybrid Reward Architecture (HRA) particularly with the Ms. Pac-Man task.

Harm van Seijen:       So we want all those agents to only care about their specific problem and not worry about how to collaborate with other agents. So we have a hierarchical structure where at the bottom you have all those little agents that take care of their little problem that they need to solve. So in the case of reinforcement learning, in the case of Ms. Pac-Man, the problem that you ultimately want to solve is you want to find a behavioral policy. So you want to learn what action to take given a certain screen image. So each of those little agents computes a certain preference, given the current screen image, creates a preference over the different available actions given its specific goal. So each agent has a different goal. And a goal can be something like so in the case of Ms. Pac-Man you have all these pellets on the screen. So a specific goal would be to go to one particular pellet. So the agent that is responsible for that, whose expertise is going to that pellet, tries to create a preference over the current actions and so the action that brings it as quickly as possible to this pellet, will have the highest preference. And so all of these agents, and you have more than 150 of those, they all communicate their preferences through Q values to a top agent and then the top agent kind of combines all these preferences into a single action. In the combination, it looks not just at how many agents want to go in a certain direction, but also how important a particular agent is. To put it differently, how badly a particular agent wants to take that action.  And so certain actions as going to a pellet are less important than trying to avoid a ghost. Because if you run into a ghost, you die which is very bad. So the agent that doesn’t want to run into a ghost, his preference is much, much stronger for trying to avoid that ghost than an agent responsible for going to a pellet. So the top agent looks as the number of agents that want to go in a certain direction and also how important each agent is.

Host:       So number and intensity of the recommendations.

Harm van Seijen:       Yeah. That’s right.

Host:       So the “I see dead people” agent is going to get more attention than the “I see food pellet” agent.

Harm van Seijen:       It’s going to get more attention if the ghost is very close. If it’s far away, then it doesn’t matter that much. So then its preference is not that strong.

Host:       Oh, okay. So that makes a difference. The ghost could be right around the corner… or the ghost could be…

Harm van Seijen:       That is very important to take the right action. And then it has a very strong intensity. Whereas if the ghost is far away, then its intensity is much lower then, maybe, fruit as close has a higher intensity.

Host:       So on that same topic, in the paper on HRA and reinforcement learning, you said the best results were achieved when the AI agent acted egotistically or selfishly and left it to the top aggregating agent to make the best move. I mean, seriously, that sounds like my family when we were growing up. It’s the kids – each arguing for their own case and then Dad finally saying okay, Gretchen you get it…

Harm van Seijen:       Well yes. So, you can imagine if you have different experts that you want each expert to be really good at its particular job. So it’s really only going to care about its particular job for example. And then it’s the top agent that listens to all of those things and makes the final decision.

Host:       Let’s talk about that for a second because that, to me… We alluded to it earlier kind of this divide and conquer, and this is kind of a new thing in terms of breaking big problems down into smaller sections, assigning them and then sort of tackling the problems that way. How is that faring in research that you are doing and in any others that are kind of moving in that direction as well?

Harm van Seijen:       So there is quite some research in trying to solve a problem using multiple agents. It’s always about breaking down a problem. But it is very challenging to find an architecture that can do that in a good way. So to find a good policy, you can learn it in a stable way and efficiently. So you can build many different architectures that somehow break up a problem. But to find like a good one is actually very challenging. So that’s also what we spend most of our time with on this research was finding the right way to have these agents work together.

Host:       So, if Ms. Pac-Man was kind of a big achievement, and it really was, I mean it’s hard, hard, hard game. With what you did to conquer that game, where could you see it being used in the real world? Where might you try it in a more, sort of, practical application?

Harm van Seijen:       Yeah, I mean, obviously the real world is extremely complex. So anything in the real world, you want to kind of break it down. So this particular technique learning very intelligent behaviors, I mean, you can think of for example a really smart Cortana for example, a really proactive personal agent.  Because if you interact with that agent, you have to take into account actions. You have to trade off immediate versus the future, things that happened in the future, things that happened immediately. I would say the real world is really complex and you want to break it down.

Host:       So, it has transferability… just how you bring that into these other scenarios is maybe one of the next steps in the research?

Harm van Seijen:       That’s right.

Host:       The other question – we talked about this before and I talked about it with quite a few people in your field. This Hybrid Reward Architecture and reinforcement learning is considered an important step towards, the path towards artificial general intelligence… as opposed artificial intelligence, which is present in a lot of things as we know it now. Talk a little bit about the difference between those two concepts.

Harm van Seijen:       Right. So the difference is really about being good in one particular problem, being really good in one particular thing versus being good in many different things at the same time and kind of combining things with each other. So humans have very good general intelligence, so they can do many different tasks. Versus right now, like a lot of AI is very specialized. So they are good in one particular thing, but nothing else. So the goal towards going to general AI is trying to create a system that just like humans can do many, many different things. It can easily switch between tasks.

Host:       Yeah. And one of the things I’ve read about the goal is machines that can think and reason and communicate with humans, like humans.

Harm van Seijen:       Yeah. That’s the ultimate dream, right? So if you can communicate with your computer for example, just in the same way that we are communicating now, it could make things so much easier because like the world becomes more and more complex and if you have like a device that can deal with the complexity, but at the same time you can interact with in a very easy way, that can be really powerful.

Host:       What are you working on right now?

Harm van Seijen:       So yeah, like I said, it’s trying to remove obstacles, trying to build better RL that, that can be applied to bigger problems. A lot has to do with scalability, RL works really great on some restricted instances. So those islands of tractability I talked about earlier, trying to increase those islands.

Host:       You’re working with a group of people.  Are there separate threads or I should say lines of inquiry that you guys are dividing and conquering on? Do you work with teams that try to work on a particular problem?

Harm van Seijen:       So if you talk about Microsoft Research Maluuba, then yes, we have different teams there working on different problems. So one team works on machine reading comprehension for example. Another team works on dialogue and then we have my team that works on reinforcement learning. So within like Microsoft Research Maluuba, we have different teams working on different things. Then within the reinforcement learning group, we also have a couple of projects that we’re focusing on. We try to set high goals and those have to be tackled by groups of people. You can’t solve them on your own. So we really try to think well about what are the important problems, what do we want to solve and then try to create interest among multiple people so we can actually make some progress there.

Host:       So outside of Microsoft Maluuba, this sort of broader research horizon, are there other interesting things going on with reinforcement learning?

Harm van Seijen:       So there are many active areas of research within reinforcement learning. There’s only a few islands of tractability that we can tackle right now. And those are things like exploration, efficient exploration, option discovery, representation learning, generalization. So there’s a whole range of different active areas of research. We are working on some of them. But not all.

Host:       Harm, what inspired you to get into this?

Harm van Seijen:       I have a background in applied physics. And if you look at physics as a field, then it’s a couple of hundred years old. If you compare artificial intelligence with physics, it’s a really new field. It’s maybe 50 years old or something. So it’s a really exciting area to do research in. It could have such a big impact on our society if you can actually well solve general AI for example. If it would solve it, it would completely change our society. But even if you make steps towards that, it can already have a really big impact. So it’s exciting in two ways: from a research perspective it’s exciting because it’s a new field compared to the different sciences and it can have a massive impact in the world. So those two things is what makes it really exciting for me.

Host:       And you were that kind of kid growing up, like what can I discover, how can I change the world?

Harm van Seijen:       I was always interested in research. So it took me a while before finding the right type of research I guess. But from the start, I’ve always been very “research” I guess.

Host:       Yeah? Even as a child?

Harm van Seijen:       Yes, absolutely.

Host:       What kind of research did you do when you were young?

Harm van Seijen:       For example, different kinds of puzzles. I was always interested in different kinds of puzzles. What you do now is kind of similar things, solving puzzles, but then much harder puzzles. So really I see what I do right now is kind of similar of what I did when I was 10 years old, just at a different level.

Host:       Exponentially different… Do you see a lot of talent coming up in universities, whether it’s in Europe or here or Canada that are ready to take the baton and run in this field?

Harm van Seijen:       Yeah, I mean, all across the world, I think AI is getting more and more popular at universities as well. I think here in Canada we’re really at the forefront. So we have some great universities here where some of these techniques, deep learning, but also reinforcement learning that came from the universities here in Canada. It feels like the right place to be.

Host:       It sounds like you are in the right place and we’re excited to watch as this field continues to grow and change the world. Harm, thanks for joining us today.

Harm van Seijen:       Yeah, thank you very much.

Host:       To find out more about Dr. Harm van Seijen, and the groundbreaking work going on at Microsoft Maluuba, visit

[End of recording]

Up Next

John Langford and Rafah Hosn

Artificial intelligence

Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn

Episode 75, May 8, 2019- Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer.

Microsoft blog editor

Katja Hofmann

Artificial intelligence, Search and information retrieval

Malmo, Minecraft and machine learning with Dr. Katja Hofmann

Episode 39, August 29, 2018 - Dr. Hofmann talks about her vision of a future where machines learn to collaborate with people and empower them to help solve complex, real-world problems. She also shares the story of how her early years in East Germany, behind the Iron Curtain, shaped her both personally and professionally, and ultimately facilitated a creative, exploratory mindset about computing that informs her work to this day.

Microsoft blog editor

Artificial intelligence

AI, machine learning and the reasoning machine with Dr. Geoff Gordon

Episode 21, April 25, 2018 - Dr. Gordon gives us a brief history of AI, including his assessment of why we might see a break in the weather-pattern of AI winters, talks about how collaboration is essential to innovation in machine learning, shares his vision of the mindset it takes to tackle the biggest questions in AI, and reveals his life-long quest to make computers less… well, less computer-like.

Microsoft blog editor