Diving into Deep InfoMax with Dr. Devon Hjelm

Published May 13, 2020

Share this page

Episode 115 | May 13, 2020

Dr. Devon Hjelm (opens in new tab) is a senior researcher at the Microsoft Research lab in Montreal (opens in new tab), and today, he joins me to dive deep into his research on Deep InfoMax (opens in new tab), a novel self-supervised learning approach to training AI models – and getting good representations – without human annotation. He also tells us how an interest in neural networks, first human and then machine, led to an inspiring career in deep learning research.

Microsoft Research Podcast (opens in new tab): View more podcasts on Microsoft.com
iTunes (opens in new tab): Subscribe and listen to new podcasts each week on iTunes
Email (opens in new tab): Subscribe and listen by email
Android (opens in new tab): Subscribe and listen on Android
Spotify (opens in new tab): Listen on Spotify
RSS feed (opens in new tab)
Microsoft Research Newsletter (opens in new tab): Sign up to receive the latest news from Microsoft Research

Transcript

Devon Hjelm: The key thing that we walked away with, with Deep InfoMax, was that we don’t really care about estimating mutual information, we don’t care about the number that corresponds to how dependent things are, we just want a model that understands whether or not there’s more or less mutual information so that we can use that number as a learning signal to train the encoder.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Dr. Devon Hjelm is a senior researcher at the Microsoft Research lab in Montreal, and today, he joins me to dive deep into his research on Deep InfoMax, a novel self-supervised learning approach to training AI models – and getting good representations – without human annotation. He also tells us how an interest in neural networks, first human and then machine, led to an inspiring career in deep learning research. That and much more on this episode of the Microsoft Research Podcast.

Host: Devon Hjelm, welcome to the podcast.

Devon Hjelm: Thank you. Glad to be here.

Host: So, you are a senior researcher who’s deep into deep learning at the MSR Lab in Montreal. So I’ve had several of your colleagues on the show over the last couple of years and we’ve talked about different flavors and different approaches to machine learning and learning machines, but today I want to hear your take on what you’re all up to there. What’s the big goal of your lab, and has it changed over the past couple of years at all, or grown more nuanced given new discoveries and advances in the research?

Devon Hjelm: Well, yeah, so, the lab is relatively new. It’s only been under Microsoft or MSR for like two or three years now, and the lab is also fairly diverse. It started from a background of like machine reading comprehension and language understanding, trying to build like tools based on language and knowledge graphs and stuff like that for people to moving to Montreal and just basically becoming part of the ecosystem there. Incorporating more deep learning, incorporating things like fairness and FATE. Its mission is very much still focused on empowering people through research and compute and stuff like that.

Host: Right. So how would you define sort of the big audacious goal of the work you’re doing in Montreal?

Devon Hjelm: So, the team that I’m part of, we’re kind of like a deep learning camp, I guess. We’re the people who really focus on using these very large deep neural networks. And so, the core idea that we’re kind of really like focused on is how do we use these big things to help empower people, give them really interesting new, useful tools that improve their lives. Almost everything that we’re assuming here is that we’re going to be using deep learning or deep neural networks to do this because, over the last decade or so, we’ve seen tremendous, kind of like, explosion of utility on models that are based off of deep learning.

Host: Right.

Devon Hjelm: And we anticipate that to continue to be the case.

Host: Well, let’s talk more specifically about what you’re investigating personally and why you think it’s important. Give us the Virtual Earth 3D snapshot of the research interests you have and what they bring to the broader field of machine learning. What gets Devon Hjelm up in the morning?

Devon Hjelm: When you look at, when you are using like a large scale model to produce something useful for people in the world, you are kind of talking about, the model’s taking some data, usually complex and high-dimensional that’s coming from the real world, it’s transforming it in some way and then, from that transformation, it’s kind of producing utility. So, for one example, you can imagine like a self-driving car. It’s exposed to a camera video feed, and then from that camera video feed, it builds an understanding of, kind of like, all the different objects that it sees in its view. For instance, like different cars, different people and stuff like that. And then from there, it like makes a decision where to drive, so that it successfully navigates you down the road without like any catastrophic accidents. So, the intermediate step in-between that is like, what is the product of, sort of like, the processing of that big network that leads to the good performance? So in the case of a self-driving car, you need a visual system that’s able to identify what all the objects are, what they’re doing, what their velocities might be, and so I can make good decisions on whether or not I want to, you know, turn or go straight or slam on the brakes or something like that. I’m really interested in, sort of like, how do we arrive to those good, what we call, representations of the world from high-dimensional data.

Host: Well let’s rewind for a minute because where you’ve been has influenced where you are today. You did a postdoc under Yoshua Bengio who’s a bona fide Turing Award winner and one of the godfathers of deep learning. And he also was one of the founders – or the founder – of Montreal Institute of Learning Algorithms, or MILA. And I know you are still collaborating with Dr. Bengio today. But talk a little bit about what you were working on during your postdoc days, and how that work has evolved and informed what you’re working on today.

Devon Hjelm: Yeah, so I’ve always been like extremely well-influenced by Yoshua and also the general camp on which he kind of is centered on, which is the whole deep learning camp. Yoshua has always been, sort of like, really strongly involved with generative models and representation learning and unsupervised learning. So, it was just, kind of like, a natural fit for me to do a postdoc over there. So, while I was there, I focused on generative adversarial networks, also called GANs, and this work, kind of like, naturally led into mutual information estimation because there’s a lot of parallels, or kind of similarities, between how a generative adversarial network learns to generate data and how you might estimate mutual information.

Host: Mmm-hmm.

Devon Hjelm: And then this ultimately led into stuff having to do with learning representations using mutual information estimation.

Host: All right. I want to go back a little bit because you mentioned a “camp,” and if I understand that, it’s like people getting together and saying this is how I believe, this is my worldview of deep learning, as opposed to another worldview of deep learning. So, can you kind of differentiate what the difference is there?

Devon Hjelm: I mean, ultimately, everybody is, sort of like, interested in this general problem space that I described initially which was like, how do you take complex real-world data and do useful things with it? How do you plan? How do you reason? Stuff like that. But a key component of this is how do you process, or how do you perceive, the world? Up until, sort of like, deep learning appeared, the fields weren’t having tremendous success on how to process like very, very large dimensional data that was coming from vision or natural language, and so when you look at the, kind of like, high level view of what it means to process complex data and to do useful things on it, different people focus on different parts of that. So, for instance, there’s a whole field of people who basically focus on features that are given from very complex neural networks and just figure out how to reason on top of those. But then there’s also people who believe that, you know, however we perceive the world, they should be packed into symbols that resemble formal logic and we need to be using these sorts of things if we really want to be talking about solving these really hard problems. And so, the deep learning camp kind of, sort of defaults to the idea well, we’re just going to throw the whole thing, end-to-end, at the problem, and train the whole thing end-to-end, and do back propagation, throwing as much data as possible at it. And you know, it’s worked really, really well, and it continues to sort of be one of the factors that drives us forward.

Host: Well I think it’s interesting because it does affect, you know, your choice in research on what direction you are going and how you are going to run at the hill.

Devon Hjelm: Yeah, and one of the consequences of having to do things end-to-end is it’s extremely expensive. So, it’s actually becoming more and more difficult. Back in the day, people who were working with these static image datasets that were small, like 32×32 pixels, and this has slowly expanded and exploded to very, very large datasets. People are working with video and, with that, if you want to do things end-to-end, the compute costs goes up, it becomes more difficult to run like thousands of trials of similar models to see which ones work better, and it becomes like a thing that’s harder for like smaller researchers to do and more of a thing that’s done by the very biggest players.

Host: Well, we’ve been – and by we, I mean, you – have been working on AI for more than fifty years now…

Devon Hjelm: Right.

Host: …and there’s been some amazing progress in the field, especially in deep learning when it comes to performing easily definable and discreet tasks, but when it comes to performing tasks in complex real-world situations, we are, as you say, still very far from solving AI. So, in broad strokes – and I want you to stay kind of high-level here – what’s the big problem and what’s holding machines back IRL?

Devon Hjelm: So, I mean, the way that I see it, one of the biggest challenges that we’re facing right now, after, sort of like, the surge in deep learning, is generalization. So, this is the ability for a model, given that it’s been trained in a certain way, to perform well in a different situation. And this is really important because it’s either very difficult to impossible to collect all possible data that you would need that would resemble, sort of like, the test environment. So, for instance, you can imagine the self-driving car scenario. It’s very expensive for me to try to train a visual system under all possible road conditions, at all times of day, at all locations on earth. And these models, they do have a tendency, if you are not careful, to totally fail when you present them with new combinations of data such as that.

Host: Mmm-hmm.

Devon Hjelm: So, if I only train in, you know, northern California and I transfer to Quebec in the middle of winter, there’s things about those systems that will fail. And then, in addition to that, I mean, if we want these things to work with humans who are notoriously good at expressing unique, hard-to-model behavior, our models have to be pretty good at generalizing to that behavior to actually be useful. Otherwise it will only be useful to like a subset of the population, and that’s not what we really strive for.

Host: Well, one of the thorniest long-standing challenges to ML researchers is learning good representations without annotation. And this is part of the expense problem, right, is the labeling data and so on. So, what’s wrong, in your opinion, with the annotation model and the learning algorithms behind it and what kinds of learning algorithms do you think we need to take us into a new ML future?

Devon Hjelm: There’s a couple of different things I suppose. If you are going to train a model under, sort of like, the standard supervised setting, suppose I’m given, like I said before, like self-driving car data, and somebody annotates like the position and the class for every single object in the visual scene, and then I’m able to train a model to this end-to-end, you know, it could resolve to a pretty good representation that I might be able to do some planning on. But that annotation alone is very difficult to do. But on top of that, it’s difficult to say whether or not any particular annotation is useful for a general task. You know, some scene that’s happening, some video scene or something like that, and you are trying to describe what’s going on, how would a human describe what’s going on? It might only capture a fraction of what really is going on and a model trained in that way might only be useful for certain tasks and not for other tasks.

(music plays)

Host: When we’re talking about learning good representations, that’s sort of one of the nuggets of what you’re after, right? So, let’s get specific and talk about how you’re taking a run at this good representation hill. Last year you presented a paper at ICLR that outlines an approach you call Deep InfoMax. So, tell us about Deep InfoMax, and start kind of “writ large.” What is it and what are the learning principles on which it’s based? We’ll get real specific and technical in a second here, but give us the big picture.

Devon Hjelm: Sure, sure. So, at the high level, it’s a type of model that learns representations in an unsupervised way, that is without labels that a human needs to define ahead of time. And it’s also, I guess, what’s being called a self-supervised model. And so this is a model that kind of, instead of tasks being designed by a human, in the sense that the labels are targets that it’s trying to predict are coming from, you know, something like the class cat or dog, it generates its own labels by basically playing around with the statistics of the data. There’s two, kind of like, core themes behind Deep InfoMax. So one is, you are given a bunch of data that has some structure, like there’s patches. I can extract patches from images. I can present the model whole or parts of the image and I can basically ask the model, can you tell whether or not these things go together or not?

Host: Yeah.

Devon Hjelm: And it’s just basically a two-way game, just yes or no.

Host: Okay.

Devon Hjelm: Does it go together? So, this is like one part of it. And then the other part is like the actual function that you use to train this thing. So, there has to be, sort of like, a number that the model outputs that tells you how well it’s doing, and this thing is the mutual information estimation maximization thing.

Host: Okay.

Devon Hjelm: When you present a model, say two different sets of pairs, and you ask it to differentiate between these two, this effectively is forcing the model to learn something about the mutual information about the things that go together, because you are encouraging it to understand the dependencies or the relationships of the things that go together. So, for instance, if I give you a bunch of different pictures of patches of the same cat, you have to understand a little bit of the structure of a cat. And so, these things are dependent or related in the sense that they all eventually compose the same thing, a cat.

Host: Mmm-hmm.

Devon Hjelm: But if I present other things and I just basically say, you should be able to tell that these things go together, as opposed to like say, patches that come from cats and dogs, it forces the model to learn that these things are related.

Host: Let’s unpack Deep InfoMax technically on several levels. And start with a critical question that I think you borrowed from Sesame Street, can you tell which thing is not like the other? So, you’re addressing this with the question, does this pair belong together? How do you do that by both borrowing from and diverging from the technical approaches in the Mutual Information Neural Estimator, as you call it, or MINE?

Devon Hjelm: So, MINE, at its core, is meant to estimate mutual information. And so mutual information is this quantity that expresses how related two different random variables are, or how related two different sets of random variables are.

Host: Right.

Devon Hjelm: And so, it’s an extremely important quantity because being able to tell how related things are can help with all sorts of things like prediction, all sorts of other, sort of, important downstream tasks. But it’s also a very notoriously difficult quantity to estimate. So if you have very high-dimensional data that’s continuous, say like images or language, there traditionally hasn’t been any straightforward way to estimate it because you have to do this, sort of like, infinite integral over distributions that you don’t necessarily know. This is where neural networks come in. Neural networks, and in particular GANs, have this ability to estimate log ratios of probabilities…

Host: Mmm-hmm.

Devon Hjelm: …without actually needing to know what the structure of that distribution is. And what I mean by that structure of that distribution is like, we don’t know if it’s Gaussian or, you know, Poisson distributed or whatever. But GANs, they estimate these log ratios and then they use them to train their generator function.

Host: Mmm-hmm.

Devon Hjelm: So, if you look at the mutual information, it’s just like a divergence, it’s a difference, like a difference between two different distributions. One is the joint distribution between two variables, and one is the product of marginals. And so, the joint distribution is just basically the probability that these two things co-occur, and then the product of marginals is their probabilities that they occur independently of each other. And so, the way that GANs do this estimation is you just draw samples from two different distributions, and you train a discriminator. And a discriminator is just a classifier. So, you just present samples from one distribution, present samples from the other distribution, and you ask, does it belong to one distribution or the other distribution?

Host: Okay.

Devon Hjelm: So, this is, sort of like, the technical thing. So like, if for instance, if I’m trying to train a model that’s able to distinguish between, you know, cats and dogs, I present it with cats and I tell it hey, this is label zero, I present it with dogs, this is label one. And at the end of the day, if you train like a standard deep network classifier, it’s learning to estimate the log ratio of the probability of the cat or probably the dog. So mutual information estimator, what it essentially comes down to is it’s training a classifier between samples that go together and don’t go together. So Deep InfoMax is very much based on our work on Mutual Information Neural Estimator, or MINE. What Deep InfoMax basically does is, it takes like a full image, and it presents it through a deep neural network. And when it gets processed through this deep neural network, if you look at different layers of this network, this is a convolutional neural network so different locations of this convolutional neural network have been processed by different patches of the image. So, you can think of these like features at these different levels, at these different locations, being part of the input, right? So, what Deep InfoMax basically does is it says, well, all those features basically go together, so I’m going to group them altogether and present them to a classifier and say, well, tell me that these go together, classify it as zero or one, whatever you call together, and then take combinations of those patch representations with images that came from somewhere else, put those together and say, these don’t go together.

Host: Okay.

Devon Hjelm: And so that process is actually very similar to what we did in mutual information neural estimation, is the things that go together, these are really like samples from the joint distribution. The things that don’t go together, well this resembles something like samples from the product of marginals. And so, when you train a classifier to distinguish between these two, you’re training the model in a similar way that you are in MINE to interpret the dependencies between all the things that go together, that make them go together, like why do they go together? And that’s, sort of like, encoded into the idea about the joint distribution. So, when you do that, you really are estimating something like the mutual information. But the key thing that we walked away with Deep InfoMax was that we don’t really care about estimating mutual information, we don’t care about the number that corresponds to how dependent things are, we just want a model that understands whether or not there’s more or less mutual information so that we can use that number as a learning signal to train the encoder.

Host: Well, let’s talk a bit more about some of the problems that arise when you aim for, as you call it, “pure mutual information maximization.” You’ve said in the past, that’s not actually what we’re aiming for here. So, what do you do with the issues of noisy information, and what do you really want to aim for here?

Devon Hjelm: I guess there’s like two different ways to answer that. So one is that, I mean, when we do this Deep InfoMax-style learning on images, while it does resemble something like mutual information maximization, there’s a caveat in the sense that MINE is only an estimator. It’s sort of like a lower bound to the mutual information. It can only learn the number of dependencies that it’s capable of based on the capacity of the neural network and the number of samples from the world that it’s received. So, the lower capacity of the model, the less stuff it will be able to learn. The less samples it’s exposed to, the less stuff it will be able to learn.

Host: Mmm-hmm.

Devon Hjelm: But it has to learn something, so Deep InfoMax is based on this, sort of like, structural thing where you patch things in different ways. This kind of biases the model to learn things that are expressible, structurally. So for instance, because I’m effectively doing a comparison kind of game between different patches of the same image, it needs to understand why those patches are related and it maybe doesn’t need to understand something more nuanced like the texture of one of the patches compared to the other, and the reason why it doesn’t need to understand that is because it’s maybe not learnable. It’s a much harder problem than just understanding whether or not, like, the shape goes together, or the general color goes together, or something like that. So, the model will focus on those things that are easy to pick up. And a lot of times, how we design these tasks, this way of breaking up the data in a particular way, when we apply it to mutual information neural estimation-style learning, or Deep InfoMax, matters more than the actual objective that we use.

Host: So, is there anything different, or anything sort of noteworthy, about any of the technical aspects of Deep InfoMax that sort of says hey, stand up and take notice? This is a new approach to solving some of these problems?

Devon Hjelm: So, the main nuanced thing about the Deep InfoMax model was that, so as I mentioned before, we were taking these local representations, these features that corresponded to a patch of the input.

Host: Mmm-hmm.

Devon Hjelm: And it’s important, in Deep InfoMax, if you do things that way, that those features actually correspond to a patch. What’s interesting about a lot of the convolutional models that are used in the wild, like the very popular ones like ResNet, while they have a spatial extent as you progress through the network, very, very quickly, those locations cover the whole input.

Host: Mmm.

Devon Hjelm: So even though it has a spatial extent, there are locations that are spatial in the neural network, but if you back-project and look at the stuff that it’s processed, it’s actually processing the whole input. So that’s not actually a different view of the input anymore. Um… It’s good and bad in the sense that it’s nice that, at some point, the architectures are mixing everything it can from the input to try to infer whether or not something belongs to a class, in the case of supervised learning. But in the case of self-supervised learning, if you want to really leverage these locations from the natural architecture, then you need to be a little bit more careful about how you apply architectures. So, if you have these architectures that quickly expand over like full receptive field of the input, then you run into trouble. And so Deep InfoMax is kind of particular in that it really tried to leverage the internal structure of the model over just say like pure messing with the input data and then designing losses on top of it.

(music plays)

Host: At this point in the podcast, Devon, I always ask my guests what could possibly go wrong, so I’ll ask you, too. And I do this because I want to address some of the elephants in the room where, you know, this is a powerful technology. Is there anything about your work that keeps you up at night, metaphorically, and if so, how are you addressing it?

Devon Hjelm: I think we all have to, kind of like, think about this, whether or not our work is like more directly related to things like fairness and privacy and nefarious agents, then if we don’t. So, some people, particularly on like FATE team, focus on more, like sort of, the reasoning aspects of our models. Like how do they take data presented to them and produce results that are fair or retain privacy and stuff like that? But, kind of like, the general trend that we’re seeing is that we’re using, for our reasoning, we’re using more and more deep models that produce features that people use to do that reasoning on top of. And so, I’m very much interested in how the quality of those features of that model impact these, I don’t know what you call them, moral metrics. So, for instance, is it possible that my model, if it’s presented with a face of a person, also encodes their identity perfectly or something like that? That my features either do or don’t make it easy for someone to infer where those features came from and their identities…

Host: Right.

Devon Hjelm: …that they might want to keep hidden. So, yeah, I mean, like in particular when you’re talking about like the mutual information stuff, like we’re working really hard on maximizing mutual information all over the place, so we’re just trying to capture as much information about the data as possible. But you could also imagine using the same techniques to minimize mutual information. So, you just basically like flip the sign on some of the law objective functions and say okay, there’s these properties that I really don’t want in this representation. Whatever you do, whatever this representation looks like, minimize it. And you can use the exact same objective functions to try to do the exact same thing. It’s a little bit trickier because you are dealing with this min/max, but you can imagine doing stuff like that. So, it’s like another sub-part of that whole question about what a good representation is, is sometimes you don’t really want all the underlying things in that representation.

Host: Right.

Devon Hjelm: You maybe want to actually hide things.

Host: Yeah, yeah, yeah. I want to drill in just a little bit on how you can control, at the outset, keeping a lid on the things that you could see going wrong with machines that act in the world like humans do and pass the Turing test on a grand scale.

Devon Hjelm: So when we talk about whether or not a representation is good or not, or useful, which I guess is like, sort of like, the core of what I’m focused on, it’s important that, among the collection of things that we use to evaluate our models, we keep in mind metrics that evaluate things like fairness and privacy. So, one thing that we’re seeing, as we progress, in representation learning is that it’s not just like one metric that really matters as far as like whether or not the representation is going to be good for deployment on some complicated downstream task. It’s not going to be just classification, there’s like a suite of things that we have to evaluate models on and the suite of things, kind of like, provides a better story, like a fine grained story, as far as like whether or not this representation truly is going to be useful. And one of those dimensions of usefulness is things like privacy and fairness.

Host: Hmmm. Speaking of stories… Tell us about yourself, Devon, and your path to machine learning research at MSR, and how’s the ML game better since you joined the team?

Devon Hjelm: I mean I guess I’ve always been pretty interested in representation learning, like understanding representations of the world, like deriving them. So, I started in physics, which is about learning about representations of the world which have to do with like dynamics and all sorts of quantities, physical quantities. And then I got interested in languages, because I was interested in how people represent the world through their language, through words and their relationships and stuff like that. And so, I went from physics to linguistics and did a stint there. So, I quickly realized that like, at least where I was at, they didn’t have the tools necessary to really solve the types of problems that I was interested in which was like understanding, from language, how humans represent the world. So, I started getting really interested in using, you know, computers to help solve these problems, like models and stuff like that. So, I got involved with some people over at the CS department, and then joined the PhD program in computer science. At University of New Mexico, probably like one of the stronger, sort of like, groups that were focusing on like modeling complex data and learning representations, were on the neural imaging side. So, there was like a big research institute called the Mind Research Network. So, I talked to Vince Calhoun, who’s still chief officer over there, and I said, hey, I’m interested in these deep neural networks. I think they might be useful for neuroscience stuff. They were looking at like, sort of like, brain imaging data like FMRI, EG, and some other related datasets and modes. And he said oh, okay well, here is some public data that we have available, try your model on it and see how it works. And I did it, and it produced something interesting, and then he said okay, well, I’ll take you on as a grad student. So, I lived over there for a couple of years, and I was kind of like the black sheep who was using the deep neural networks while everybody else was using these more linear models like ICA and PCA and stuff like that. So, me and my, sort of like, unofficial advisor Sergey Plis were kind of like the deep learning nerds and we put out some nice papers that used deep learning and showed that it worked with FMRI data and… but through that whole process, because Mind Research Network was such a good like research institution with good connections and grant money and stuff like that, they were able to connect our small group to a bunch of really big names in deep learning like Russ Salakhutdinov and Kyunghyun Cho. Russ was, at the time, a professor at University of Toronto. He’s since moved on to CMU, and Kyunghyun Cho was a postdoc at the time with Yoshua. You know, this entire time I’m like pushing on the whole representation learning stuff, but in the context of neural imaging, learning how to use new models, even my stuff with generative models, it was all about learning good representations because you can use the intermediate states for generative models as representations as well.

Host: Mmm-hmm.

Devon Hjelm: And just through those connections, I was able to get, like, more deep learning papers into NEURIPS, and through that connection I was able to like, reach out to Yoshua at the end of my PhD and was able to connect up with them there.

Host: And so, you were at MILA for a while, Montreal Institute of Learning Algorithms.

Devon Hjelm: Yeah. I’m still an adjunct professor there.

Host: Okay.

Devon Hjelm: So, I still co-supervise students and help them with research.

Host: What’s one thing we don’t know about you, Devon? Something interesting that may have impacted your career, sort of personal. Maybe it’s a side quest or a personality trait or something interesting to sort of give us context about who you are a little deeper in?

Devon Hjelm: Well, I played bass, you know, more or less professionally, for three years in grad school. I was playing…

Host: Were you in a band?

Devon Hjelm: Yeah, like, I’d have gigs like three, sometimes four days a week, playing salsa.

Host: Are you kidding?

Devon Hjelm: No. I’m not. So, yeah, we would play everywhere from like casinos to like dances to everything like that. I mean, I played music my entire life, but I was pretty ambitious about being very, very good and so that group was pretty cool because there were super high-skilled salsa musicians from like all over the world. One person was from like a touring group in Cuba. And then there was another guy who joined our group who was on some Grammy albums. And so, I got to do that for a little while. It was really good! So, like I was working as a musician on top of doing my PhD.

Host: Do you still play?

Devon Hjelm: I don’t play bass anymore because I got tired of spending all my Fridays and Saturdays playing the same music all the time and not being able to like sit and like enjoy watching someone else play. Right now, I’m learning how to play the mandolin because it’s easier to play by yourself!

Host: Let’s get some parting thoughts and shots in. As we close, I want to give you the chance to think ahead and dream about the future. Let’s say you’re wildly successful, Devon. What will the world look like at the end of your career? What will have you accomplished in your field, and what will be able to do that we hadn’t been able to do before?

Devon Hjelm: I can imagine all sorts of like things that we can do with models that we can’t do today. Like for instance, like present a model to brand new environment that’s able to navigate and explore this environment on its own with very little help with from human experimenters and learn everything it needs to learn to do useful stuff. I firmly believe that it’s like important that our ultimate goal for all of this AI effort is to arrive to models and algorithms and agents that are useful to human beings in the real world so that they can do things that they couldn’t do before more easily, sort of like, to empower the more general population. On top of that, if I was wildly successful, is continuing an ongoing exciting community of people working on really difficult problems because they are passionate about it. Everybody has, sort of like, their own ideas for as like what would be useful or good for people. And as long as I’m part of that, someone who gets to interact with that community and help build and shape those things, I think that’s the best possible thing that I can hope for. And so, I’m just hoping that like I can be part of that community and it continues to thrive.

Host: Devon Hjelm, thank you for joining us today. It’s been really fun!

Devon Hjelm: Thank you.

(music plays)

To learn more about Dr. Devon Hjelm, and the very latest in deep learning research, visit Microsoft.com/research