Provably efficient reinforcement learning with Dr. Akshay Krishnamurthy


head shot of Dr. Akshay Krishnamurthy for the Microsoft Research podcast

Episode 117 | June 3, 2020

MSR’s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they’ll tell you they’re very interested in getting it out of the lab and into the real world. One of those researchers is Dr. Akshay Krishnamurthy (opens in new tab) and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction.



Akshay Krishnamurthy: When you have casual conversations with people, you typically emphasize the, like, final performance and when you talk about these high-profile achievements, we don’t often spend much time scrutinizing the sample efficiency. And one thing that we’ve been kicking around a little bit is actually, can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront?

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: MSR’s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they’ll tell you they’re very interested in getting it out of the lab and into the real world. One of those researchers is Dr. Akshay Krishnamurthy and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction. That and much more on this episode of the Microsoft Research Podcast.

Host: Akshay Krishnamurthy, welcome to the podcast.

Akshay Krishnamurthy: Thanks for having me.

Host: Let’s start by situating. You’re a Principal Researcher at MSR in New York City and part of what I’d like to call a small but tactical unit working on reinforcement learning in real world settingsboth for the short-term, you called that translational, and the long-term. So, to kick things off today, tell us how these two approaches are different, and how they are being tackled in your lab, and then talk about how they are each making an impact on the science of reinforcement learning, both practically and theoretically.

Akshay Krishnamurthy: I like to categorize our projects in terms of the time horizon with which we might push things into a product. And the first, kind of, bucket here is more translational efforts. These are the kinds of things that maybe we hope to put in production in the next year or two years. And then the other bucket is maybe more blue-sky kind of efforts, things that maybe are five years out, or ten years out, or you know, just more high-risk kind of projects. Let me talk a little bit about the translational stuff.

Host: Yeah.

Akshay Krishnamurthy: So, our group maintains and develops this RL system that is currently deployed in production in several settings. Most of these are recommendation kind of settings.

Host: Mmm-hmm.

Akshay Krishnamurthy: And we’re constantly adding features. We have this amazing group of engineers and data scientists working on this product and the researchers are adding features to the system based on new research and also, kind of based on requests and feedback from product groups. So this is I think a really cool way to do research where you’re working with a product group on some deployments and then they come to you and say hey, we also want to do X, and then we think about it and we’re like oh, we actually don’t know how to do X. So, then we, like, go do some research and try and figure out how to do that kind of thing and then we circle back with the product group maybe in six months or one year when we figure out something.

Host: Okay.

Akshay Krishnamurthy: And one example of this is actually this project we did on recommendation when the sort of decision space that the algorithm is operating over is very, very rich and typically, like, of a combinatorial nature. So, the prototypical example here is personalized news. So, you don’t just choose like one story to show to the user, you might choose like ten stories and you have to rank them. So now your decision space is much, much more complicated and it becomes much more difficult, statistically, to figure out what sort of recommendation rules are better. And so we sort of came up with some technique that’s inspired by a lot of literature in learning theory community and we brought this back to the news team, and I think, in the last one year, this has sort of been making its way to production in this system that we’re running. So that’s kind of like what the translational efforts look like. And the other side of the story is this blue sky or long-term stuff and here, I think, we’re not that attached to applications when I think about this style of research. I mostly am thinking about, what are the really challenging problems in reinforcement learning and how can we come up with models and algorithms that address those issues? I don’t really know the timescale that these things will pan out, right? I think these problems are super hard. I’m definitely not the only one thinking about them. You know we’re making progress, but it’s just not exactly clear to me when these things will make it to applications, but the hope is that, you know, as we build our understanding, we’ll be able to do that.

Host: Well let’s situate you, Akshay. Researchers often live in an intersection, or they like to say they do. So, what’s your address? Where does your research passion lie and what gets you up in the morning?

Akshay Krishnamurthy: Yeah, so these days I’m pretty interested in the statistical issues around reinforcement learning. But this, I think for me, is part of a bigger story. I’ve always been interested in interactive learning which, typically, when I tell this to people, they think of something else, but I usually use this to refer to any sort of learning setting where the algorithm engages in feedback-driven data collection. And this is, I think, one of the central problems in reinforcement learning. The agent operates in this world, makes a bunch of decisions, and the decisions it makes influence the data that the agent has. And so, part of the challenge is actually learning to make the right decisions to collect the right data. So that sort of where I came from. I’ve always been interested in this sort of like broader topic of designing algorithms to collect the right data. And the last four or five years, it’s mostly been about reinforcement learning because that’s a very hard version of that problem.

Host: Well let’s go a little deeper on the theoretical for a moment and talk about what you referred to as incentive structures in scientific circles. So, what does the current landscape look like, particularly with regards to how benchmarks are set up and measured, and how we might be thinking differently about incentive structures so that the reinforcement learning ball moves forward?

Akshay Krishnamurthy: I think data efficiency is one of the core issues in RL. I was just mentioning this. And so, I think most researchers are interested in moving towards real world applications. And the ones that people typically kick around are recommendation, high-precision medicine, personalized medicine, robotics, these kinds of things. And in these settings, we may not have the luxury of collecting millions or billions of interactions before we produce some high-quality policy. So, I feel the real-world applications are quite different from the kinds of simulated environments that we are currently using for benchmarking. And, don’t get me wrong, the benchmarks are amazing. They’ve really pushed the field forward. But I also am kind of mindful about the way benchmarks also influence the community, so there’s some kind of feedback loop here.

Host: Right.

Akshay Krishnamurthy: And so, these benchmarks kind of overemphasizes quality of final policy at the expense of, how much experience did you need to get there? And I think, when you have casual conversations with people, you typically emphasize the, like, final performance and when you talk about these high-profile achievements, we don’t often spend much time scrutinizing the sample efficiency. And one thing that we’ve been kicking around a little bit is actually, can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront?

Host: Right.

Akshay Krishnamurthy: And I think these are kind of things that hopefully will catch on and actually like nudge the community back to thinking about statistical issues.

Host: Right. Let’s get down to the nitty gritty and talk about math. The beating heart of RL research is algorithms and you told me they need to be able to do three things. So, I am going to have you get down to the nitty gritty on some impressive algorithms you’ve been working on in a minute. But first, give us a framework for the research. What things are reinforcement learning algorithms supposed to be able to do, and where are we on the path towards “cheap, fast, good, pick three” instead of “pick two?”

Akshay Krishnamurthy: This is actually something I took from one of John Langford’s talks, and the core capabilities of RL algorithms are credit assignment, which is when you make a sequence of decisions and observe a certain outcome, you know how to allocate, like, which decisions were responsible for that outcome. And this is not so easy because maybe you did one hundred things and then at the end, you know, you are at a pot of gold and is it because you turned left over here or turned right over there or whatever? That’s credit assignment. The other issue is exploration. So, you’re an agent, you’re in some complicated environments. You run around for some time. If you run around blindly, you will not, you know, find all the corners of every nook and cranny that you might need to, you know, solve your problem, so you have to explore the environment. And the other one is generalization. If your environment is relatively complicated, you might not actually see the same thing twice. And what you’d hope for is that the skills you learn and the patterns you pick up in one place can transfer to other situations. So, this is this thing that my colleague Dipendra always talks about. He’s saying, I learned how to bike in Brooklyn and when I go to Amsterdam, I should also know how to bike. So, we should be able to generalize our experience to new experiences.

Host: Okay.

Akshay Krishnamurthy: So those are the three capabilities and when you take any pair of these right now, we have relatively mature algorithms that often are used in production, and we have like a pretty solid theory for how to do two of three. So, I think the one that is pretty common these days is doing credit assignment and generalization without exploration.

Host: Hmm.

Akshay Krishnamurthy: And so, what’s going on in a lot of these policy search kind of algorithms, like maybe you heard of policy gradient methods. And so, what these algorithms do is, you have some current policy. You execute it perhaps with some noise to sort of deviate a little bit. And you pick up some signal about what local changes to your policy are doing, and this is sort of solving the credit assignment. This deviation is addressing the credit assignment aspect. And your policy is maybe parameterized by some neural network and so that’s handling the generalization aspect. You don’t have to keep track of every single state you might see. So, then you do these local deviations and you make small improvements on your policy. So, these algorithms are actually pretty effective and they work really well in, like, continuous controlled kind of problems but they kind of are doing a local search and so they don’t really work very well in problems where you have to do very sophisticated exploration. But maybe one research question is, can we bring in exploration into this style of algorithms? We certainly don’t have robust algorithms for doing all three. The blue-sky optimism is that we can do all three.

Host: Yeah. 

Akshay Krishnamurthy: And I think even in this “cheap, fast and good” the definitions keep moving. So certainly, I think now, like if you think about with the advent of the internet, we can access information very cheaply, very quickly and it’s very high-quality.

Host: Yeah. 

Akshay Krishnamurthy: And it’s very different from maybe what it was thirty years ago or something.

Host: Absolutely.

Akshay Krishnamurthy: So, the definitions keep moving and that’s fine. But I think we should actually be optimistic to push for these things that maybe seem almost impossible.

(music plays)

Host: Well let’s talk about some specific algorithms since we’re on that track right now. The first one is one you call Policy Cover by Inductive Decoding, or PCID, and the exploration around this work was published in 2019 in an ICML paper titled – ready? Provably efficient reinforcement learning with rich observationsTell us about PCID and how it moved the needle on provably dataefficient methods. What were the key claims and contributions of this work?

Akshay Krishnamurthy: This paper is part of a longer effort on this sort of blue-sky track of designing algorithms with these three capabilities of credit assignment, generalization and exploration. So, in an ICML 2017 paper, we came up with a more general theory for some sort of way to build algorithms with these three capabilities and some, kind of, environment assumptions under which these things are possible. But the algorithms in that paper are not particularly practical, so this ICML paper 2019 is more focused towards practicality, although I would still consider this a relatively theoretical paper. So, what did we do? So, we have this new algorithm, it’s called PCID, and we can use it to find near optimal policies for a restricted class of environments that we call Block MDPs. And it does this in a provably efficient way, both in terms of the amount of experience it requires, and the total computation. And I think one thing I do want to highlight here is that these Block MDPs, they’re relatively structured environments, but they do actually require the agent to demonstrate all three of those core capabilities: of credit assignment, generalization and exploration. But the other thing I want to highlight is the phrase “provably efficient” that I mentioned. I don’t want to suggest that other algorithms do not have these capabilities, what I’m just saying is, we are able to prove that our algorithm does. Perhaps the key claim is, we’re proving that this algorithm has some data efficiency guarantee and that is something I kind of want to emphasize. So, let me tell you, like, the high level of the approach. It’s superficially similar to, I think, what’s now a relatively common strategy which is to create some auxiliary supervision based on what the agent is doing, and use that to identify some kind of compact representation of the world and then use the compact representation to drive an exploration module. So, the way we sort of realize this is, our representation is like a discreet latent state and that’s why this thing is called decoding. We decode the latent state of the world and then we run one of these algorithms that can do exploration and credit assignment, but not generalization on the decoded latent state. So that’s like the high-level idea. And to explain more I have to tell you about this Block MDP model. The model is this kind of stylized setting in which there exists some latent state space that’s relatively small – in the literature they might call this a tabular state space – but these states are never observed by the agent. Instead, the agent will be in a state, but it doesn’t know that, and it will see an observation which is like sort of emitted from a distribution associated with this state, and the observation might be extremely high-dimensional and very, very rich. But the observation might actually have enough information for the agent to figure out what its latent state is. But the agent sort of doesn’t know how to do that compression. So, there’s some mapping in the world that takes observations and maps them to the latent state. And if the agent only knew this mapping, it could just run one of these tabular methods, which do exploration and credit assignment but don’t do generalization. So, this model kind of encourages a little bit of a, like, a factored approach where you do the generalization part, you know, to map observations to state, and then you do the other two things using a technique we already know.

Host: Okay.

Akshay Krishnamurthy: The key challenge here is that you never observe the latent state. So, it’s not that clear, how are you going to learn this decoder? And this is sort of where the auxiliary supervision comes in. So, we use supervised learning to train a predictor to predict some other stuff, and somehow, we show that, when you predict this other stuff, you recover the latent state. The algorithm proceeds in this, like, forward manner: so, you take, like, a hundred steps and then you reset. And then you take a hundred steps and you reset. And each one of these is an episode. And so, we can assume we have the decoder at stage H, some time step, and we’re going to use the decoder at the previous time to learn the decoder at the current time. And the way you do it is, you take the observation at the current time and you predict the previous action and the previous latent state, or decoded state.

Host: Right.

Akshay Krishnamurthy: You can imagine it kind of recovers the dynamics of the world because the reason you’re seeing this observation now is because you were in that state yesterday and you took that action yesterday. So, you recover, like, the backward dynamics of the process. And because the dynamics are governed by the latent state only, you kind of learn this compression. And then, once you have the latent state at stage H, we learn how to visit every state that you could possibly reach at stage H, and then that gives you a good data coverage. So that’s like the exploration component. You get good data coverage at stage H, and you use that to drive the process of learning the decoder at the next stage. So, the PCID supervision is, predict the previous learned state and action from the current observation. Okay, so what do we prove? We prove that, if you are in a Block MDP with a not too many latent states, not too many actions per stage, and the time horizon is not too long, and then there’s some technical assumptions, we prove that this algorithm will find some way to visit every state of the world with good probability, without too many samples. So, the number of samples will be polynomial in the number of latent states, the time horizon, the number of actions, all these sort of relevant quantities, and then, we learn how to visit every state of the world, which is solving the exploration problem.

Host: Right.

Akshay Krishnamurthy: And the generalization component is baked in because this is a Block MDP, and you have to do a decoding. On the computational side, we operate in this sort of – it’s called an oracle model of computation – so we assume that you have a class of functions, maybe it’s like a neural network or something, and that you can solve supervised learning problems over this neural network. And that’s where this auxiliary supervision comes in. We train a neural network, or whatever, to map raw observation to previous state action and we assume that you can actually solve those classification problems. And so that’s the computation. Let me just mention the last piece which is the first one where we actually ran some experiments.

Host: Okay.

Akshay Krishnamurthy: Because this algorithm is not super practical, but not super impractical either, it’s somewhere in the middle.

Host: It’s the Goldilocks one.

Akshay Krishnamurthy: Yeah.

Host: Just right.

Akshay Krishnamurthy: It’s one that we felt like we could implement. So the experiments, I think, are more of a proof of concept that it is actually doing something sensible than anything else, and what we show is that we sort of concoct this kind of very, very difficult environment for exploration and we run these algorithms that don’t do any exploration and they fail, obviously, and then we run an algorithm that… we give it cheating access to the latent state and we run an exploratory algorithm. So, this is like that algorithm that does exploration and credit assignment, but not generalization, and it does extremely well. And then we run our algorithm which does, like, not much worse than that thing. Okay. So, it’s like the first one is, like, obviously not going to work. The second one is obviously going to work, and the question is, how close are we to, like, the skyline that we could hope to achieve? And so, we are getting there and, but we’re solving at that harder problem where you have to do the decoding.

Host: Okay. And that was 2019. We’re all the way into 2020. Since then, you haven’t sat around on your algorithms. You have a new paper in the pipeline with another algorithm that you think is actually pretty cool. The paper is called Kinematic state abstraction and provably – there’s that word again – efficient rich observation reinforcement learning. That’s a mouthful, and also a mathful. In this paper you present another algorithm called Homer. Tell us about Homer, and how it moves us closer to solving for X, if X equals robust, real world reinforcement learning. And again, feel free to show your work.

Akshay Krishnamurthy: Yeah, so Homer is… I would think of it as an improvement on PCID, but the high-level claims are kind of similar. So, we’re still working in a Block MDP. On a technical level, we’re actually able to remove some of these technical assumptions that I didn’t really explain to you in the previous story. But those are actually required by the PCID’s theory and we can actually show that PCID empirically fails when some of those things are not satisfied. So that’s actually kind of important. But I think there are two, perhaps, more important points about Homer. So, Homer is much, much more robust in practice. And part of the reason is that, in PCID, we’re trying to use the current observation to predict the previous state and the previous action. Now we don’t have access to the previous state, so we predict the previous predicted state, okay? And this can cause some catastrophic failure when your previous predicted state is wrong, okay? So, you’re trying to predict something that you’re predicting yourself which causes some sort of bubbling of errors that doesn’t work very well. And so, in Homer, we use a different auxiliary supervision problem that does not rely on the previous one. We still do this kind of inductive thing, but we train this decoder for time H, and then we like throw it away, and then we train a completely new decoder for time H plus one. And so, this makes the algorithm somehow much, much more robust and much less prone to these, kind of, propagating errors.

Host: Okay.

Akshay Krishnamurthy: Which is a very, very common problem in reinforcement learning. The other thing that we’re doing in Homer, which is, I think, empirically very important is, in PCID, we are using the latent state space that we build – we build like a dynamics model over the latent state space – and we do what they call model-based planning. And this is also prone to failure when your model is wrong. Okay. So, you’re trying to plan, in your model, to reach some state, but your model is wrong and so you just completely fail. And in Homer, we replaced this planning module with, like, a policy search module that operates again on the raw observations. So again here, we’re not using the decoders, we’re just saying we’ve collected a good data set that covers the whole world and we can use the data to train a machine learning model to do the things we want to do. So those are the two reasons that Homer is, like, an empirical and in some ways a theoretical improvement. But the other, more important point is that in the Homer project, we’ve identified a new structural notion that we think might be useful in the design of other algorithms. And this is also this first part of the title which is this kinematic state abstraction. It’s kind of some property of an environment where you might say two states of the world are very similar if the dynamics to and from those states are the same. You should be able to collapse these two things together because, if one policy gets to one, it also gets to the other. So, for the purpose of exploration, these two states are very coupled. And for the purpose of forward prediction, they’re also very coupled because they have the same dynamics going forward.

Host: Yeah.

Akshay Krishnamurthy: And so, this kinematic state abstraction is a way to formalize this. And it’s kind of easy to show that, in Block MDPs, all the observations coming from a latent state are tied together. They induce the same forward and backward dynamics. This idea also inspires this new supervised learning auxiliary supervision problem that we use here. So, this prediction problem is what they call contrastive learning. So, what you’re going to try and do is you’re going to splice data sequences together in some real and fake way, and you’re going to try to predict which ones are real and which ones are fake. So, I’m the agent. I see some observation. I take an action. I see some next observation. That’s a real transition. So, I have an observation-action-observation triple. Now, I’m at some new place in the world, a new observation, I take a different action. I see another observation. Let me swap the last two observations between these two triples. Those are fake transitions, right? And now, if I can somehow train a model to predict what’s real and fake from this crazy data set that I just like made up, right? But I know the label because I’m doing the splicing. So, I train a supervised learning model, just a classifier, to predict what’s real and what’s fake. If you think about it, you can kind of just pick out the transition operator because you know what a real transition is now. And that means you know what the, like, distribution of the next state is going to be given your current state in action.

Host: Yeah. 

Akshay Krishnamurthy: And again, we can show that somehow, when you solve this problem, you kind of are decoding the latent state. And then, like, sort of a lot of the stuff that we do with PCIDs is going through again. On the empirical side, it’s still a relatively toy experiment. But we’re more serious about comparing with the algorithms that people are running in practice. There was one algorithm that was pretty good. So, this is one of these policy search methods when you attach on an exploration module based on this idea called random network distillation. So, it’s this chain of like one hundred actions, so this thing got like to like level twenty-five or thirty or something, but Homer just crushes the whole thing because it’s kind of designed to do so. I think it’s, in some ways, demonstrating what is a very difficult exploration problem.

Host: Yeah. 

Akshay Krishnamurthy: The algorithms that people typically use do not solve those hard exploration problems. And Homer is sort of constructed to do so and does it.

(music plays)

Host: Let’s talk briefly about the section in every academic paper that addresses research limitations and what scientists are still working on. In general, what big challenges remain in this process or pipeline of algorithmic perfection, and how are you moving into unlit spaces to resolve them?

Akshay Krishnamurthy: Let’s think about this on the theory side and the empirical side. So, on the theory side, I think it’s becoming important to identify more expressive models that emphasize the need for these core capabilities of generalization, exploration and credit assignment. To date, we’ve been working on the Block MDP model which, it certainly does these things, but it’s quite limited in many ways. So, a lot of problems tend to have like a continuous but low-dimensional latent state, and this is not captured by the Block MDP. But we’re also thinking about like, somehow, substantially more expressive models. So, we have some new proposal for a model that’s like exponentially more expressive than the Block MDP.

Host: Okay.

Akshay Krishnamurthy: And whether we can build algorithms for those things. So like Homer and PCID really are leveraging this discreet latent state space in some very, very strong way and, because they’re so attached to the model, they don’t really work in the real world. Like your real world does not have that Block MDP structure and then what happens, right? But the other frustration here is that we kind of know that some modelling assumptions are required for provable sample and computational efficiency, and so the question is, what are the assumptions that are relevant to applications? And perhaps we should just keep expanding until we cover most things, but it might not actually work in that way. Like, maybe we’ll have to sort of fragment and say okay, for this type of application, we’re going to focus on this set of assumptions which are relevant and come up with a specialized set of algorithms and so on. So that’s kind of the theory side. I think we are sort of pushing on, like, more expressive models and algorithms that address more general settings. On the empirical side, Homer and PCID are very much algorithms designed by theoreticians, and there’s a question of whether we can make them more empirically effective. I think there are problems that do have something close to Block MDP structure and if we can make these algorithms somehow more natural, more robust, more practically viable, maybe we can start pushing towards those kind of problems. And so, we have an ongoing effort around this where we’re trying to identify some problems that might have the kind of structures that Homer is sort of exploiting and also trying to make Homer more somehow a little more data-efficient in these kind of things.

Host: It’s about now that I always ask what keeps you up at night, both figuratively and literally. Reinforcement learning is a bit of a rock star when it’s in a controlled or simulated environment, but once it hits the open world, the state space gets real, as we might say. So, what could possibly go wrong, Akshay, and how are you working, from the beginning, to ensure that it doesn’t?

Akshay Krishnamurthy: Well, to be honest, what usually keeps me up at night is some, like, bug in a proof or something in my code that’s not working and this has been especially true in the recent months because I’ve been, like, working very hard on this fairly delicate proof and keep getting stuck like every day. So mostly I can’t sleep because I, like, cannot solve my, you know, technical problem. On the higher-level, in terms of, I think, what could go wrong, I’m definitely aware that when you pursue these sort of blue sky efforts, they’re necessarily pretty high-risk and it could be the case that, when you take this theory-first approach, or even when you are thinking about reinforcement learning in the way that we were thinking, it just might not pan out. So, it just could be the case that we push on this for you know, five years, ten years, and then we just hit a dead end. And that would kind of be depressing, but I think it’s part of what happens when you do this kind of high-risk stuff.

Host: Yeah.

Akshay Krishnamurthy: But I’m also a bit reassured that, you know, RL is a very, very popular topic. We have a large community of incredibly brilliant and amazing people that are working on these problems, and everyone is tackling them with a different approach and different perspective. So even if what we’re trying to do doesn’t really pan out or doesn’t do the things that we want, I think we will find something that makes sense as a community. And from like a scientific perspective, that sort of I think keeps me sane. And I always have conversations with people about this kind of stuff. Like, when I give a talk about our work, they’re like, well, do you really think this is going to pan out? And I say, I don’t know, but like someone should try it, right? So, I think there’s a real risk that maybe these approaches don’t make sense. And obviously we’re thinking about when they do make sense and trying to make them make sense, but maybe they will not work, right? And that kind of keeps me up at night a little bit, but usually it’s the more technical stuff. The other concern about, I think which is sort of what you are talking about is, when we deploy these algorithms in the real world, what could go wrong? I certainly think a lot of things could go wrong and there’s a lot of things to be concerned about, like, you know, there’s biases, there’s safety issues, there’s fairness issues. And there’s like become a very vibrant community of people working on those kind of problems. I think we’re so far from applications… like I don’t encourage anyone to deploy Homer anywhere! But I think I would say that, you know, MSR has brilliant people working on these kind of issues. And I think if and when we do get closer to deploying something, I am super fortunate to feel like I can go talk to them and like loop them in and see like what should we be doing and – I don’t claim to be an expert on those things at all, so I would just want to take advice from other people on these kind of things.

Host: Well, it’s story time here on the Microsoft Research podcast. So, let’s hear yours. What got you started in high tech, what was your academic path toward research and how did you land in New York City (or at least post-COVID, the home office in Brooklyn) working on reinforcement learning for Microsoft Research?

Akshay Krishnamurthy: Yeah, so I grew up in a pretty technical family. I was born and raised in Silicon Valley. Both of my parents were engineers. My brother is a computer scientist, so I think it was kind of in the cards. That said, I didn’t think I wanted to be an academic probably through most of grad school. So, when I went to college, I kind of had more of an entrepreneurial bent. I think this is super natural for people in Silicon Valley. They all kind of want to do that. You hear about all these tech stars and so on. So, I didn’t really think I wanted to be an academic. I was, you know, hacking on random start-uppy project ideas in my dorm room and stuff. I did some industry internships at tech start-ups and these kind of things. And I actually didn’t really like it. So kind of what happened is, I did those things and I didn’t enjoy it and then I did a research project in undergrad and then I also had this amazing opportunity to spend a summer in Tel Aviv doing a research project with a professor there. And that was like super fun and really kind of pushed me towards going to grad school, and then my last summer of grad school, I was like super fortunate that Alekh and Miro like invited me to do an internship. And they were just, hey, do you want to come spend the summer in New York doing an internship? And I was like yeah, that sounds amazing and, so I did that. I really, like, fell in love with the lab. And since then like I think John has really been an advocate for me, so he…

Host: Langford?

Akshay Krishnamurthy: John Langford, yeah. So, he like encouraged me to apply for a postdoc. He was on my thesis committee. He wrote a letter for me. So, I came back for a postdoc and then I spent two years as an assistant professor at U Mass and then he sort of encouraged me to come back as a researcher. I had to do this advertisement at a conference about MSR and I think the thing that is real is like I’ve been there for three different stages of my life and I keep coming back. So, it is like really a glorious place to be to do research and I love it a lot. So, I think having these people like Alekh Agarwal, Miro Dudik, John Langford, they have been super inspirational. They’re really like role models for me and have really like given me an opportunity to you know, be here and succeed and so on.

Host: What’s one interesting thing we don’t know about you? I leave it open-ended so it can really be anything. Maybe it’s a personal characteristic or life event that impacted your career or life. Maybe it’s a side quest or a hobby that defines you outside your scientific work or maybe it’s just something fun or funny that people might not suspect about you?

Akshay Krishnamurthy: Yeah, so I was actually discussing this question last night with my partner, and the conclusion is that I’m pretty cliché for a, like, a tech bro. So, I do all the things that like, you know, the San Francisco start-up people do. I like to rock climb, I play ultimate Frisbee, I read a ton of sci-fi. I am re-reading The Three Body Problem right now which is just an excellent book, I cannot recommend it enough. I’m also kind of a foodie. So, in the pre-COVID times, Sid Sen and I would frequently sneak out of the office to grab you know, fancy sweets, and being in New York, there’s just like tons of amazing restaurants and dessert places to go to. So, when I go to conferences, I almost always bring my climbing equipment. And there’s starting to be like a group of people that are in this crew of machine learning climbers and we typically go check out the local bouldering gyms or maybe even go to like an outdoor spot for climbing. And there’s also, obviously, a bunch of foodies in the ML community, so we have a small group of people who go check out like, you know, Michelin star restaurant and stuff. So, like the academic lifestyle is kind of great for those kind of things.

Host: What’s the biggest thing you’ve climbed?

Akshay Krishnamurthy: So, I mostly boulder, so the biggest thing I’ve climbed is not very tall. And mostly I actually do, like, indoor bouldering, so, you know, these climbing gyms are cropping up everywhere.

Host: Right.

Akshay Krishnamurthy: At NeurIPS last year, me and friend went to… so this was in Vancouver, there’s this like you know, world-class bouldering area just north in Squamish.

Host: Yes.

Akshay Krishnamurthy: And so, we went bouldering in Squamish. It was a bit raining so the rocks were kind of wet…

Host: Slippery.

Akshay Krishnamurthy: …which means we couldn’t do anything.

Host: No.

Akshay Krishnamurthy: I think I’m also not that good, so… The outdoors is really just when you kind of get humiliated.

Host: Yeah.

Akshay Krishnamurthy: Nature slams you. And I think the other thing is like the way they do this grading, the grading outdoors is not commensurate with the grading indoors. So, you kind of get humbled.

Host: They need an algorithm for that.

Akshay Krishnamurthy: Yeah.

Host: Well as we close, and this has been super funI’d like you to dream out loud about the future for a minute and give our listeners a vision for next gen reinforcement learning. I sometimes frame this now as, if you are wildly successful, what will you be known for at the end of your career, and how will the research you and your colleagues have done made an impact on the world, both to the scientific community and people at large?

Akshay Krishnamurthy: So, I really hope that we can be involved in the development of data-efficient, empirically effective reinforcement learning algorithms. I think it’s a little bit hard to predict where they will be effective. So, this is this lamppost sort of phenomenon that we were talking about the other day. I don’t know what all the applications might be for such algorithms, if we have them, and this is part of this constrained thinking where, like you know, we think about applications that our current technologies can solve and we don’t think about applications that maybe new technologies might enable. Looking under the lamppost is this story… or it’s like a metaphor… so there’s this person and they lose their keys. It’s the nighttime, they’re looking for their keys under the lamppost because that’s the only place where they can see, even though they are a hundred percent sure that is not where they lost their keys. It’s just where there’s light. So, this is this thing of, you know, we work on the things that we know we can do. And the applications that we think about are the things that are like just – just beyond our grasp. But maybe if we have some radically new technology, the set of applications will just really open up and we cannot really predict what those things might be. I think it would be really awesome to be a part of that effort and see what kind of things open up, but it would be really amazing to see that these algorithms do have like a positive impact on the world.

Host: Akshay Krishnamurthy, this has been more fun than a podcast about algorithms should be. Thank you so much for coming on!

Akshay Krishnamurthy: Thank you very much!

(music plays)

To learn more about Dr. Akshay Krishnamurthy, and the very latest in reinforcement learning, visit

Continue reading

See all podcasts