Microsoft Research Podcast

Microsoft Research Podcast

An ongoing series of conversations bringing you right up to the cutting edge of Microsoft Research.

Hybrid Reward Architecture and the Fall of Ms. Pac-Man with Dr. Harm van Seijen

December 6, 2017 | By Microsoft blog editor

Episode 3, December 6, 2017

Hybrid Reward Architecture and the Fall of Ms. Pac-Man with Dr. Harm van Seijen

If you’ve ever watched King of Kong: Fistful of Quarters, you know what a big deal it is to beat a video arcade game that was designed not to lose. Most humans can’t even come close. Enter Harm van Seijen, and a team of machine learning researchers from Microsoft Maluuba in Montreal. They took on Ms. Pac-man. And won. Today we’ll talk to Harm about his work in reinforcement learning, the inspiration for hybrid reward architecture, visit a few islands of tractability and get an inside look at the science behind the AI defeat of one of the most difficult video arcade games around.

To find out more about Harm van Seijen and the groundbreaking work going on at Microsoft Maluuba, visit


Podcast Transcript

 …trying to mimic how the brain works, it’s more being inspired about how the brain works. This is true for neuro networks for example. They are also based on how our brain processes information. It doesn’t mean that it’s an exact copy of how the brain works. That’s not the goal. I mean, machines have different capabilities, so it’s not so much about trying to mimic exactly a human brain rather than being inspired.

You are listening to the Microsoft Research podcast. A show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huzenga (phonetic).

If you ever watched King of Kong fiscal of quarters, you know what a big deal it is to beat a video arcade game that was designed not to lose. Enter Dr. Harm Vanseijen and a team of machine learning researchers from Microsoft Maluba (phonetic) in Montreal. They took on Ms. Pacman and won. Today we’ll talk to Harm about his work in reinforcement learning, the inspiration for hybrid reward architecture, visit a few islands of tractability and get an inside look at the science behind the AI defeat of one of the most difficult video arcade games around. That and much more on this episode of the Microsoft Research podcast.

Harm, give our listeners a brief description of the kind of work you do.

Respondent:       I work on reinforcement learning which is a certain class of machine learning that focuses on learning good behaviors. So it’s a very powerful method. You can use it in many different instances. But there’s still a lot of research to make sure that it can be applied in a broader setting. So we are working on those challenges to remove those obstacles so it can be applied in a very broad way.

Moderator:       Why is machine learning and particularly reinforcement learning such an important field in artificial intelligence?

Respondent:       So with more and more data, you can just build more and more complex systems. So when your systems become more complex, at a certain moment, you can just not code everything by hand anymore, but you want it to learn automatically. For example, classifying images, to build a classifier that can recognize certain objects, that if you would do it *, it would be very complex. So if you do it by machine learning you can automatically learn it. So this helps you to build a very complex classifier that you couldn’t otherwise encode by hand. So with reinforcement learning, it’s a similar thing, but it’s then about behaviors. In behaviors, it’s about taking actions. So you can for example, it’s using alpha go. You are able to build a go player that is much stronger than humans. Because it learns automatically, you can build something that is much better than humans.

Moderator:       One of your big achievements recently is beating the game of Ms. Pacman.  Why are video games so suitable for exploring artificial intelligence?

Respondent:       Well they are suitable because they give you a very controlled environment to test certain ideas. So if you are dealing with applications in the real world, you then have to tackle the full complexity of your problem all at once. Whereas in a game, you can play with how complex you make your problem, and it’s a more controlled environment to test certain ideas. Also because you can run it faster than real time for example, so you can very quickly have a very quick turnaround time for building algorithms.

Moderator:       So you are running the game at a faster speed than the game normally goes?

Respondent:       Yeah, exactly. So for example, if we take the example of Go as an example, so if you would play Go in real time, it’s a very slow game and maybe a game lasts an hour or two hours. If you play it in an artificial environment, you can run it much faster, so you can play many, many games in the same amount of time that you would play a single game in real time. So yeah, it can give you a big speed up in that scenario.

Moderator:       But Ms. Pacman for example, which is a different kind of a game obviously, it’s a video game, that moves super-fast to begin with. Is it the same sort of incremental speed increase relevant there?

Respondent:       Yeah, I mean, because you always want to go faster. So Ms. Pacman we can run much faster than real time. We can run it maybe 30 or 40 times as fast as real time. So it means that your total computation time, if in real time it would take a month, if it’s 30 times as fast, it would take you a single day. So it makes a big difference

Moderator:       This is where machines are more capable than we are currently in terms of speed of processing and calculation and things like that.

Respondent:       Yes. It’s interesting because if you look at games, there are a couple of aspects where machine really has an advantage and there are a couple of aspects where a human has an advantage. So if you have a game that is challenging mainly because it requires a very high reaction time, then that would be a game where machines naturally have an advantage. Versus on the other hand, you have games that really require long-term reasoning and those are games that are very suitable for humans but these are much harder for machines.

Moderator:       I’m wondering if we can make them feel inferior, but I don’t think they have those kinds of feelings at the current time.

Respondent:       Not yet, no.

Moderator:       That’s what we’re working for, right?  Listen, as a level set, kind of talk to me a little bit about the broader world of machine learning right now and differentiate some of the concepts.

Respondent:       Within machine learning you have different problems basically. So the 3 big ones are supervised learning, unsupervised learning and reinforcement learning. So those are really problem definitions. For example, reinforcement learning tackles the problem of an agent that interacts with an environment. If you compare that with deep learning for example, so deep learning is really – it’s not so much a problem definition as it is a technique. So it’s a particular technique to do function approximation; in particular adding many different layers of neuro networks for example. So it’s a technique you can use on these different problem instances. So if you combine deep learning with reinforcement learning, you get something called deep reinforcement learning and it just means that you are tackling the problem of reinforcement learning using a function approximation that uses deep learning.

Moderator:       What is hot right now?  Or are they all – are there areas in machine learning that are really interesting to a lot of people?

Respondent:       Yes. So deep learning really had a big boom a couple of years ago. So that’s like super hot right now. It had like a long-history like in the 80s, it was also popular, but then it kind of died down again. So the most recent boom was a couple of years ago when they discovered how you could build much deeper networks and much more powerful networks. Deep learning has received a big boom recently. I think reinforcement learning is just on the brink of breaking true. The most recent 2 years, a lot of companies have become very interested in reinforcement learning as well. So I think that’s the next big thing.

Moderator:       It seems like as you say it, it makes me think that the booms, when a particular researcher or a group of researchers or even inventors if you will, make a breakthrough in it and then everyone pays attention. It’s like hey that’s new, that’s interesting, let’s go on that thread. Where are you with reinforcement learning in that process? Are you still on kind of the research breakthrough phase?

Respondent:       So in terms of maturity I think it’s much less mature. It’s more still in the research phase than something like deep learning. So there’s still a lot of problem instances that we cannot solve yet or not solve well yet. So there are a couple of islands of tractability that within certain problem instances in reinforcement learning that we can solve. So in particular, if you consider the vendor problem which is a special case of a reinforcement problem. That is one that we can do very well and it applies for example in ad placement. So placing ads, showing ads on a website that can be modeled as a vendor problem. So it’s being used there in real products. So there are some subset of reinforcement learning we can already use and we can do well and use in real products. But for the most part, it’s still a research effort.

Moderator:       Right. Not infancy necessarily, but certainly not mature.

Respondent:       Yes, absolutely.

Moderator:       Did you say islands of practability?

Respondent:       Yes. So just certain problem instances that we have a good handle on.

Moderator:       Okay, good. I think I’ve been stuck on an island of intractability before, but.

When we talked before, you said your work encompasses computer science and neuro science. In essence you are drawing on theories of how the human brain works.  How are you applying these theories to your work with machines?

Respondent:       I would more see it as rather than trying to mimic how the brain works, it’s more being inspired by how the brain works. This is true for neuro networks for example. They are also based on how our brain processes information. It doesn’t mean that it’s an exact copy of how the brain works. That’s not the goal. Machines have different capabilities so it’s not so much about trying to mimic exactly the human brain rather than being inspired. That also holds for certain algorithms in reinforcement learning. It’s more being inspired by how we think, decision making in human’s world than trying to make an exact copy of that.

Moderator:       That’s interesting.  Speaking of inspiration, how did you come up with the idea of hybrid reward architecture? What was the inspiration behind that?

Respondent:       So the inspiration really came from how humans cooperate to build great products. For example, if you have a smart phone, it’s a great piece of technology. Many, many people were involved in building it. And there is not really a single person that knows how to make a smart phone. But it’s really the group of person, all of them that each have their own expertise that know how to make a smart phone. So we wanted to build something similar where if you are trying to solve a very complex task, to have a bunch of different artificial agents that work together and each agent is focused on a different aspect of the task. So each agent has a different expertise. Then by combining those agents in a particular way, they show an overall behavior that is very intelligent.

Moderator:       So it’s kind of the distributed expertise model of business as it were.

Respondent:       Right

Moderator:       But only with artificial intelligence agents within a program?

Respondent:       Yes, that’s right.

Moderator:       So let’s talk about that for a second. When we were talking about HRA before, you mentioned the importance. Because you’ve got these agents acting individually. You program them to do specific tasks. But then there’s a necessity like there is in a business for a boss to make a decision. So explain how that works with inside this hybrid reward architecture (HRA) particularly with the Ms. Pacman task.

Respondent:       So we want all those agents to only care about their specific problem and not worry about how to collaborate with other agents. So we have a hierarchical structure where at the bottom you have all those little agents that take care of their little problem that they need to solve. So in the case of reinforcement learning, in the case of Ms. Pacman, the problem that you ultimately want to solve is you want to find a behavioral policy. So you want to learn what action to take given a certain screen image. So each of those little agents compute a certain preference given the current screen image, creates a preference over the different available actions given its specific goal. So each agent has a different goal. A goal can be something like in the case of Ms. Pacman you have all these pellets on the screen. So a specific goal would be to go to one particular pellet. So the agent that is responsible for that, whose expertise it is going to that pellet, tries to create a preference over the current actions and so the action that brings it as quickly as possible to the pellet, will have the highest preference. So all of these agents – and you have more than 150 of those, they all communicate their preferences to Q values to a top agent and then the top agents kind of combines all these preferences into a single action. In the combination, it looks not just at how many agents want to go in a certain direction, but also how important a particular agent is. To put it differently, how badly a particular agent wants to take that action.  So certain actions as going to a pellet are less important than trying to avoid a ghost. Because if you run into a ghost, you die which is very bad. So the agent that doesn’t want to run into a ghost, his preference is much, much stronger for trying to avoid that ghost than an agent responsible for going to a pellet. So the top agent looks as the number of agents that want to go in a certain direction and also how important each agent is.

Moderator:       So number and intensity of the recommendations.

Respondent:       That’s right.

Moderator:       So the IC (phonetic) dead people is going to get more attention than the IC food pellet agent.

Respondent:       It’s going to get more attention if the ghost is very close. If it’s far away, then it doesn’t matter that much. So then its preference is not that strong

Moderator:       Oh okay. So that makes a difference. The ghost could be right around the corner.

Respondent:       That is very important to take the right action. And then it has a very strong intensity. Whereas if the ghost is far away, then its intensity is much lower. Then maybe * as close as a higher intensity.

Moderator:       So on that same topic, in the paper on HRA and reinforcement learning, you said the best results were achieved when the AI agent acted egotistically or selfishly and left it to the top aggregating agent to make the best move. Seriously, that sounds like my family when we were growing up. It’s the kids – each arguing for their own case and then Dad finally saying okay, Gretchen you get it.

Respondent:       Well yes. You can imagine if you have different experts that you want each expert to be really good at its particular job. So it’s really only going to care about its particular job for example. Then it’s the top agent that listens to all of those things and makes the final decision.

Moderator:       Let’s talk about that for a second. We alluded to it earlier kind of this divide and conquer. This is kind of a new thing in terms of breaking big problems down into smaller sections, assigning them and then sort of tackling the problems that way. How is that fairing in research that you are doing and in any others that are kind of moving in that direction as well?

Respondent:       So * some research in trying to solve a problem using multiple agents. It’s always about breaking down a problem. But it’s very challenging to find an architecture that can do that in a good way. So to find a good policy, you can learn it in a stable way and efficiently. So you can build many different architectures that somehow break up a problem. But to find like a good one is actually very challenging. So that’s also what we spend most of our time with on this research was finding the right way to have these agents work together.

Moderator:       So if Ms. Pacman was kind of a big achievement, and it really was, I mean it’s hard, hard, hard game. With what you did to conquer that game, where could you see it being used in the real world? Where might you try it in a more sort of practical application?

Respondent:       Obviously the real world is extremely complex. So anything in the real world, you want to kind of break it down. So this particular technique learning very intelligent behaviors, I mean, you can think of for example a really smart Cortana for example, a really proactive personal agent.  Because if you interact with an agent, you have to take into account actions. You have to trade off immediate versus the future, things that happened in the future, things that happened immediately. I would say the real world is really complex and you want to break it down.

Moderator:       So it has transferability just how you bring that into these other scenarios is maybe one of the next steps in the research.

Respondent:       That’s right.

Moderator:       The other question – we talked about this before and I talked about it with quite a few people in your field. This hybrid reward architecture and reinforcement learning is considered an important step towards, the path towards artificial general intelligence as opposed artificial intelligence, which is present in a lot of things as we know it now. Talk a little bit about the difference between those two concepts.

Respondent:       So the difference is really about being good in one particular problem, being really good in one particular thing versus being good in many different things at the same time and kind of combining things with each other. So humans have very good general intelligence, so they can do many different tasks. Versus right now, like a lot of AI is very specialized. So they are good in one particular thing, but nothing else. So the goal towards going to general AI is trying to create system that just like humans can do many, many different things. You can easily switch between tasks.

Moderator:       One of the things I’ve read about the goal is machines that can think and reason and communicate with humans like humans.

Respondent:       Yes. That’s the ultimate dream, right? So if you can communicate with your computer for example, just in the same way that we are communicating now. It could make things so much easier because like the world becomes more and more complex and if you have like a device that can deal with the complexity, but at the same time you can interact with in a very easy way, that can be really powerful.

Moderator:       What are you working on right now?

Respondent:       So like I said, it’s trying to remove obstacles, trying to build better around that, that can be applied to bigger problems. A lot has to do with scalability, around works really great on some restricted instances. So those islands of tractability I talked about earlier, trying to increase those islands.

Moderator:       You’re working with a group of people.  Are there separate threads or I should say lines of inquiry that you guys are dividing and conquering on? Do you work with teams that try to work on a particular problem?

Respondent:       So if you talk about Microsoft Research Maluba (phonetic), then yes, we have different teams there working on different problems. So one team works on machine reading comprehension for example and another team works on dialogue and then we have my team that works on reinforcement learning. So within like Microsoft Research Maluba, we have different teams working on different things. Then within the reinforcement learning group, we also have a couple of projects that we’re focusing on. We try to set high goals and those have to be tackled by groups of people. You can’t solve them on your own. So we really try to think about what are the important problems, what do we want to solve and then try to create interest among multiple people so we can make some progress there.

Moderator:       So outside of Microsoft Maluba, this sort of broader research horizon, are there other interesting things going on with reinforcement learning?

Respondent:       So there are many active areas of research within reinforcement learning. There’s only a few items of tractability that we can tackle right now. Those are things like exploration, efficient exploration, option discovery, representation learning, generalization. So there’s a whole range of different active areas of research. We are working on some of them.

Moderator:       Harm, what inspired you to get into this?

Respondent:       I have a background in applied physics. If you look at physics as a field, then it’s a couple of hundred years old. If you compare artificial intelligence with physics, it’s a really new field. It’s maybe 50 years old or something. So it’s a really exciting area to do research in. It could have such a big impact on our society if you can actually solve general AI for example. If it would solve it, it would completely change our society. But even if you make steps towards that, it can already have a really big impact. So it’s exciting in 2 ways. From a research perspective it’s exciting because it’s a new field compared to the different sciences. It can have a massive impact in the world. So those 2 things is what makes it really exciting for me.

Moderator:       And you were that kind of kid growing up, like what can I discover, how can I change the world?

Respondent:       I was always interested in research. So it took me a while before finding the right type of research I guess. But from the start, I’ve always been very researchy I guess.

Moderator:       Even as a child?

Respondent:       Yes, absolutely.

Moderator:       What kind of research did you do when you were young?

Respondent:       For example, different kinds of puzzles. I was always interested in different kinds of puzzles. What you do now is kind of similar things, solving puzzles, but then much harder puzzles. So I see what I do right now is kind of similar of what I did when I was 10 years old, just at a different level.

Moderator:       Do you see a lot of talent coming up in universities, whether it’s in Europe or here or Canada that are ready to take the baton and run in this field?

Respondent:       Yeah I mean, all across the world, I think AI is getting more and more popular at universities as well. I think here in Canada we’re really at the forefront. So we have some great universities here where some of these techniques deep learning, but also reinforcement learning that came from the universities here in Canada. It feels like the right place to be.

Moderator:       It sounds like you are in the right place. We’re excited to watch as this field continues to grow and change the world. Harm, thanks for joining us today.

Respondent:       Thank you very much.

[End of recording]