Going meta: learning algorithms and the self-supervised machine with Dr. Philip Bachman

Published December 4, 2019

Share this page

Dr. Philip Bachman on the Microsoft Research Podcast

Episode 101 | December 4, 2019

Deep learning methodologies like supervised learning have been very successful in training machines to make predictions about the world. But because they’re so dependent upon large amounts of human-annotated data, they’ve been difficult to scale. Dr. Phil Bachman (opens in new tab), a researcher at MSR Montreal (opens in new tab), would like to change that, and he’s working to train machines to collect, sort and label their own data, so people don’t have to.

Today, Dr. Bachman gives us an overview of the machine learning landscape and tells us why it’s been so difficult to sort through noise and get to useful information. He also talks about his ongoing work on Deep InfoMax (opens in new tab), a novel approach to self-supervised learning, and reveals what a conversation about ML classification problems has to do with Harrison Ford’s face.

Microsoft Research Podcast (opens in new tab): View more podcasts on Microsoft.com
iTunes (opens in new tab): Subscribe and listen to new podcasts each week on iTunes
Email (opens in new tab): Subscribe and listen by email
Android (opens in new tab): Subscribe and listen on Android
Spotify (opens in new tab): Listen on Spotify
RSS feed (opens in new tab)
Microsoft Research Newsletter (opens in new tab): Sign up to receive the latest news from Microsoft Research

Transcript

Phil Bachman: Training a machine to look at a large amount of unannotated data and point to specific examples and say, well, I think if a human comes in and tells me exactly what that thing is, I’ll learn a lot about the problem that I’m trying to solve. So this general notion of carefully selecting which of those examples you want to spend the money or spend the time to get a human to go in and provide the annotations for those examples, that’s this idea of active learning.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Deep learning methodologies like supervised learning have been very successful in training machines to make predictions about the world. But because they’re so dependent upon large amounts of human-annotated data, they’ve been difficult to scale. Dr. Phil Bachman, a researcher at MSR Montreal, would like to change that, and he’s working to train machines to collect, sort and label their own data, so people don’t have to.

Today, Dr. Bachman gives us an overview of the machine learning landscape and tells us why it’s been so difficult to sort through noise and get to useful information. He also talks about his ongoing work on Deep InfoMax, a novel approach to self-supervised learning, and reveals what a conversation about ML classification problems has to do with Harrison Ford’s face. That and much more on this episode of the Microsoft Research Podcast.

(music plays)

Host: Phil Bachman, welcome to the podcast!

Phil Bachman: Hi. Thanks for having me.

Host: So as a researcher at MSR Montreal, you’ve got a lot going on. Let’s start macro and then get micro. And we’ll start with a little phrase that I like in your bio that says you want to understand the ways in which actionable information can be distilled from raw data. Unpack it for us. What big problem or problems are you working on? What gets you up in the morning?

Phil Bachman: So I’d say the key here is to sort of understand the distinction between information in general and, let’s say, information that might be useful. So for example, if images are coming from the camera that you are using to pilot a self-driving car, then low-level sensor noise probably doesn’t provide you useful information…

Host: Hmm.

Phil Bachman: …for deciding whether to stop the car or whether to turn or make other sorts of decisions that are useful for driving. So, what I’m interested in, sort of this phrase, actionable information here, it’s referring specifically to trying to focus on getting our models to capture the information content in the data that we’re looking at that is actually going to be useful in the future for making some sorts of decisions. So if we’re training a model that’s processing the video data that’s being used to drive this car, then perhaps we don’t want to waste the effort of the model on trying to represent this low-level information about small variations in pixel intensity. And we’d rather have the model focus its capacity for representing information on the information that corresponds to sort of higher-level structure in the image, so things like the presence or absence of a pedestrian or another car in front of it. So that’s kind of what I mean with this phrase, actionable information. So this distillation from raw data is on doing learning from data that hasn’t been manually curated or that doesn’t have a lot of information injected into it by a human who’s doing the data collection process. So going back to the self-driving car example, I’d like to have a system where we could allow the computer just to watch thousands of hours of video that’s captured from a bunch of cars driving around. Then what I want to be able to do is have a system that’s just watching all of that video and doesn’t require that much input from a person who’s pointing to the video and saying specifically what’s going to be interesting or useful in the future. So this information that’s going to be useful for performing the types of tasks that we want our model to do eventually.

Host: Before we get specific, give us a short historical tour of the deep-learning methodologies as a level set, and then tell us why we need a methodology for learning representations from unlabeled data.

Phil Bachman: Okay. So in the context of machine learning, people often break it down into three categories. So there will be supervised learning, unsupervised learning and reinforcement learning…

Host: Mmm-hmm.

Phil Bachman: …and it’s not always clear what the distinction between the methods are. But supervised learning is sort of what’s had the most immediate success and what’s driving a lot of the deep learning power technologies that are being used for doing things like speech recognition in phones or doing automated question answering for chat bots and stuff like that. So supervised learning refers to kind of a subset of the techniques that people apply when they have access to a large amount of data and they have a specific type of action that they want a model to perform when it processes that data. And what they do is, they get a person to go and label all the data and say, okay, well this is the input to the model at this point in time. And given this input, this is what the model should output. So you’re putting a lot of constraints on what the model is doing and constructing those constraints manually by having a person looking at a set of a million images and, for each image, they say, oh, this is a cat, this is a dog, this is a person, this is a car. So after having done that for thousands of hours, you now have a large data set where you have a bunch of different images and each of those images has an associated tag. And so now the kind of techniques that we work with and the optimization methods that we use for training our models, are very effective at fitting really large powerful models to large amounts of this sort of annotated data. So that’s kind of the traditional supervised learning. But the major downside there is that the process of providing all of those annotations can be very expensive. So that process of supervised learning has a lot of issues with scalability. What we’d like to do, ideally, is make use of a lot of that and figure out what kinds of information is actionable. So finding the information that seems like it will be useful for making decisions. So that’s getting into a contrast between supervised learning and unsupervised learning. And then there’s also reinforcement learning which is a slightly different set of techniques where you actually allow a model to go out and kind of perform experiments or try to do things and then somehow it receives feedback about the things that it’s doing that says, oh, what you just did, that was a good thing or that was a bad thing.

Host: Hmm.

Phil Bachman: And that it learns by kind of a process of trial and error. So that’s a general idea of reinforcement learning.

Host: Hmm. Okay. We mentioned two flavors of this, unsupervised and then self-supervised. Is that another differentiation there?

Phil Bachman: So the self-supervised learning, it’s not a completely different thing, but it’s a sort of subset of those types of techniques. So, the general idea behind self-supervised learning is that we try to design procedures that will generate little supervised learning problems for a model to solve, where the process of generating those little supervised learning problems is kind of automatic. And the hope here is that the kind of procedurally-generated supervised learning problems that our little algorithm is generating, based on the unlabeled data, will force the model to capture some useful information about the structure of that data that will allow it to answer more, sort of, human-oriented questions easier in the future. So just to clarify this concept of procedurally generating supervised learning problems, one really simple example would be that you could try to train a model to have some understanding of the statistical structure of visual data by showing a model a bunch of images, but what you do is you take each image and you split it into a left half and a right half. So now what you do is you take your model, and all the model is allowed to see if the left half of the image…

Host: Hmm.

Phil Bachman: …and then you have another model that sort of tries to form a representation of the right half of the image. And so the model that looked at the left half of the image, you present it with representations of the right halves of, like, let’s say, ten thousand images, one of which is the image that it looked at. So, it’s kind of got like a partner that it’s looking for in this big bag of encoded right halves of images. And the job of the encoder that’s processing the left half of the image is to be able to look in that bag and pick out the right half that actually corresponds to the image that it originally came from. So in this case, we’re taking something that looks like unsupervised learning, but instead, here, what we’re doing, is treating it more like a supervised learning problem. So the model that looks at the left half of the image, its task is to solve something that looks like just a simple classification problem. And then making this like a one thousand–way classification problem.

Host: The other thing that comes to my mind is, there’s this weird thing on the internet where like, Harrison Ford… you see half of his face and the other half of his face and they are completely different. Like if you put each halves together, they wouldn’t look like Harrison Ford, but together with the different halves, they look like him. So that would really trick the machine, I would think.

Phil Bachman: Actually, I wouldn’t be so confident about that!

Host: Really?

Phil Bachman: Yeah. The question that you’re sort of training the machine to answer is, which of these possible things do you think is most likely associated with the thing that you’re currently looking at? So unless there was somebody else’s right face half, that looked significantly more Harrison Ford-ish, than his own right face half, then the model actually could do pretty reasonably, I’d expect.

Host: That’s hilarious.

Phil Bachman: So unless you had somebody who… where it was like this really strict dichotomous separation of the halves of their face, like Two-Face from Batman or something…

Host: Right. That’s another one!

Phil Bachman: …in which case maybe the model would fail, but…

Host: I love that.

Phil Bachman: …if it’s within like standard realm of human variability, I think it would be okay.

Host: Well that’s good. So let’s move ahead to the algorithms that we’re talking about here. And you call them learning algorithms, and you’ve described your goal for learning algorithms in some intriguing ways. You want to train machines to go out and fetch data for themselves and actively find out about the world, and you want to get the machine to ask itself interesting questions so it begins to build up its own knowledge base. Tell us about these learning algorithms for active learning and what it takes to turn a machine into an information-seeking missile?

Phil Bachman: Yeah, so this kind of overall objective there that you’ve described is targeted at kind of expanding the scope of which parts of the problems that we’re currently trying to solve, are solved by the machine rather than by a person who is acting as a shepherd for the machine, or as a teacher or something along those lines. So right now, the machine learning component of most systems is a very important part of the system, but there’s a whole lot of human effort that surrounds the production and use of something like a practical image classifier or a practical machine translation system. So that’s one part of the effort that’s required for getting an automated system out there in the world. So part of the process is just the initial decision, like the thing that we want to do is machine translation, here’s a way of formalizing that problem and specifying it such that we can go out and now perform another part of the process – so this other part of process is a data collection.

Host: Hmm.

Phil Bachman: So you’d have to go out and you’d have to explicitly collect a lot of data that is relevant to the task that you are trying to solve. And then you have to take that data and you maybe have to have somebody curate it to make that data more directly useful or more immediately useful for the kinds of algorithms that we tend to use right now. So a lot of the work that I want to do is about trying to reduce the amount of human effort that’s required on those two fronts and trying to get as much of those two parts of the problem automated and built into the models that we’re training so that we don’t have to go out and manually annotate all the data.

Host: Talk to me about the technical end of that. You know, our listeners are pretty sophisticated and you are talking about algorithms that are training a machine to do something for itself. Go a little deeper there.

Phil Bachman: Okay. Yeah, I’ll kind of jump into the learning algorithms for active learning part, which I guess I actually completely skipped over as I was answering the question before. So training a machine to go out and collect its own data and point to specific examples and say, well, I think if a human comes in and tells me exactly what that thing is, I’ll learn a lot about the problem that I’m trying to solve. So this general notion of carefully selecting which of those examples you want to spend the money or spend the time to get a human to go in and provide the annotations for those examples, that’s this idea of active learning. So rather than just assuming that you have a huge batch of data and all the data is labeled, a lot of practical problems are structured more like, you have a lot of unlabeled data and you have to decide how to collect data and apply labels to it so that you can then train a model. So to do this efficiently, is you take some of the data, you train a model, and then you look at what the model is doing and you try to figure out where it’s weak and where it’s strong. And based on where it’s weak and where it’s strong, you use that to try and decide how to go out and pick other examples specifically so that you can minimize the amount of data that you have to collect and provide annotations to you such that you end up with a model that makes good predictions at the end. So that’s just active learning. And existing techniques for doing active learning, a lot of them revolve around assumptions about what kind of classifier, or what kind of decision function you are going to train on that data that you are collecting the labels for. So there might be assumptions that all of the data already has some sort of fixed representation and then you are going to feed that representation into a linear classifier, for example. And if you make that kind of assumption, then there might be very good heuristics for going out and deciding which particular sets of features you want to apply labels to. So you can minimize the uncertainly and minimize the number of errors that’s made by this linear classifier. But for working with more complicated data, or working in scenarios where you also want to learn a powerful representation of the data at the same time that you’re collecting the data and applying labels, you might want to sort of transform this process where you decide on what the model is going to be and then you sit down for weeks, or years, and come up with a very clever heuristic for how to collect data efficiently to make that model succeed when it has a small amount of labeled data. And you’d like to replace some of those more effort-intensive parts of the process with a machine that can kind of train itself to learn what kinds of data it’s going to need, at the same time that you are also training the model that’s making the prediction.

(music plays)

Host: Let’s spend some time talking about your current research, and there’s a lot of flavors to it. Let’s start with what you are calling Deep Infomax or DIM, but I want to point out too, that, in addition to Deep Infomax, you have Augmented Multiscale Deep Infomax, or AMDIM, Spatio-temporal Deep Infomax, Deep Graph Infomax… There’s a lot of sort of offshoots I guess you might call it. So I’m going to go sort of free range here because you’ll be able to give us a better guided tour of the main idea, and all the offshoots, better than I will. Tell us about the Deep Infomax research family and what you’re up to.

Phil Bachman: Okay. So the kind of higher level idea that ties these things together is the idea that we want to learn to represent the data that we’re looking at. So sometimes that data might be text, sometimes it might be images or in the case, for example, of the Deep Graph Infomax, it might be a graph. So the overall higher level idea of Deep Infomax is that we want to form representations that act a little bit like an associative memory. Kind of going back to what I was saying about the thing with the split faces before, we can think of the left half of a face and the right half of a face sort of as random variables. So you can think of just sampling the left half of a face and there might be slightly different versions of the right half of that face that are all sort of valid. So looking at the left half, I guess, as you were getting at with the Harrison Ford thing, the right half isn’t always perfectly determined, but you can think of the distribution of all possible right half faces, and the variability there is much broader than the variability that you have if you are just looking at, what is the right half of Harrison Ford’s face given that we’re looking at the left half? So the mutual information between our representation of the left half of the face and the right half of the face is high. When our ability to predict what the right half of the face looks like is very good, relative to how well we could predict what the right half of the face looks like in the case where we didn’t get to see the left half, if we were just looking at a bunch of different images that had the same shape as the images of the right halves of a face, these images have a lot of variability in their structure. Like some of them, it might be the back half or the front half of a car or something like that and looks very different from faces. So, in principle, we can sort of make a reasonable prediction, for example, of whether or not the image that we’re looking at right now encodes the right half of a face, but there’s still some uncertainty there. And then when we add in the left half of Harrison Ford’s face, and we’re trying to say, okay, well out of the distribution of things that look like the right halves of faces, which ones correspond to Harrison Ford, the more precisely we can make that guess, the higher the mutual information is between our representation of the left half and the right half of the face.

Host: Well, let me ask you to go a little deeper on the technical side of this. You sent me a slide that has a lot of algorithmic explanation of Deep Infomax and then how you kind of take that further with Augmented Multiscale Deep Infomax…

Phil Bachman: So the actual mutual information aspect, sort of formally, the way it shows up here, is that we sample this kind of true pair of corresponding image and audio sample, and then we have a distribution from which we can sample just another random audio sample and we can sample maybe, say, a thousand of those other random audio samples and we can encode them with our audio encoder. And then we can sort of present a little classification problem…

Host: Mmm-hmm.

Phil Bachman: …to the model that looked at the image, where that classification problem is telling the model that looked at the image to identify which, among, let’s say, one thousand and one audio recordings is the audio recording that comes from that same point in time. So the mutual information here, um, what we’re doing kind of more technically is we’re constructing a lower bound on the mutual information between the random variables corresponding to the representation of the image and the representation of the audio modality. So we first draw a sample from the joint distribution of the representations of those two modalities, and then we also have to sample a lot of samples from what’s called the marginal distribution of that second random variable which is the representations of the audio modality. So we draw, say, a thousand samples from that marginal distribution and we construct this little classification problem where the model is trying to identify which of the audio samples was the sample from the true joint distribution over audio and visual data, and which of the samples just came from random samples from the marginal distribution. So this is a technique called Noise Contrastive Estimation that’s been developed and applied in a lot of different scenarios. So a good example of this is techniques that have been used for training word vectors. But in the case where we’re using it, it’s a technique that can be used for constructing kind of a formally correct lower bound on the mutual information between these two random variables, one of which corresponds to you know, samples of visual data and one of which corresponds to samples of audio data.

Host: Okay.

Phil Bachman: And the joint distribution over those two kind of random variables is constructed by just going around the world with a camera and a microphone and just taking little snippets of visual and audio information from different points in time and in different scenes.

Host: All right. Well, as you described Deep Infomax, and then you have Augmented Multiscale Deep Infomax, you call that improving Deep Infomax based on some limitations in the prior. How would you differentiate how the Augmented Multiscale Deep Infomax is better than the original idea?

Phil Bachman: Yeah, so the original idea, depending on specifically how you implement it, has some significant downsides in some sense. The original Deep Infomax was just looking at a single version of a single image, and in this case, there’s sort of an issue where, if you are just looking at the single image, and, let’s say, encoding all of the little patches in the image, the way that the original Deep Infomax presentation kind of goes is that you take that image, you encode each of the patches and you also encode the whole image. And so here, we’re going to sort of train the representation of the whole image such that it can look at all of these patches and say that oh, those are patches that came from my image.

Host: Mmm.

Phil Bachman: So this is a little bit like that idea of associative memory, but it’s applied on sort of a single input.

Host: Okay.

Phil Bachman: So kind of procedurally how you would do this is that you would take an image, you would encode it, you get representations of all the little patches and you get a representation of the whole image. And now you’re going to construct a little classification problem where you take a thousand other images and you also encode their patches and you sort of mix them into a bucket with all the patches from the original image that you computed a full image encoding for, and the job of the full image encoding is to look in that bucket and essentially pick out all the patches that are part of its image.

Host: Hmm.

Phil Bachman: So one of the difficulties here, like one of the shortcomings of that particular way of formulating it, if you take that more restrictive interpretation, the main downside is that the encoder that’s processing the full image can basically just memorize the content that’s there. And it’s fairly easy for the model to just copy that information into the representation of the whole image, and essentially it’s just memory that stores the representations of all the little patches. There might be some areas in which this is useful, but for some types of predictive tasks, it might not be so useful because you’re not really asking the representation of the whole image to answer sort of interesting predictive problems about what kinds of other things might you see that weren’t explicitly in the image that you’re looking at now.

Host: Right.

Phil Bachman: So if you are looking at left half of faces and right half of faces, if, instead of looking at left half of face and right half of face, all you did was you showed your encoder this left half of the face, you encoded it to a small vector and then you showed it the same half again and you said is this the one that you looked at before? The model might not actually have to learn that much to be able to solve that task really well. But if you take it and you change it to a task where the kinds of predictions that you are forcing that representation to make are a little bit more interesting, you can ask a more interesting question which is like, did this eye come from the right half of the face whose left half you looked at? So here, now, the model is answering kind of a more challenging question.

Host: Right.

Phil Bachman: This is one of the main changes that we make when we go from the original formulation of the Deep Infomax to this Augmented Deep Infomax. So this is the augmented part, not the multiscale part. That’s another thing, where we’re looking at multiple scales of representation. But if we just look at the augmented part, kind of the big improvement there, is that we’re forcing the model to answer questions, or form an associated memory, where the associations that we’re forcing it to make are more challenging to make, so that the model has to put a little more effort into how it’s going to represent the data.

(music plays)

Host: I like to explore consequences, Phil, both intended and otherwise, that new technologies inevitably have on society, and this is the part of the podcast where I always ask, what can possibly go wrong? So you’re working in a lab that has a stated aim of teaching machines to read, think and communicate like humans. Is there anything about that, that keeps you up at night, and if so, what is it, and more importantly, what are you doing to address it?

Phil Bachman: So we do have a group here that’s working on what we call fairness, accountability, transparency and ethics. So it’s the FATE group. So they’re working on a lot of questions that are, let’s say, immediately relevant as opposed to questions that are kind of long-term relevant – or irrelevant depending on your perspective! – umm… so there’s this idea of existential risk, which is more of a long-term question. So this is the kind of question like, well, if we develop a superhuman AI, is it going to care about us and take care of us, or is it going to consume us in its quest for more resources? So we’ll set that aside. And so like the more immediately salient one is the kinds of things that the FATE is looking at, and so these are things like well, if we’re training a system that’s going to sit at a bank and analyze people’s credit history, are there historical trends in the data that might be due to systemic discrimination or systemic disadvantaging of particular groups of people, that are going to be reflected in the data that we use to train our system such that then, when the system goes to make decisions, it’s kind of implicitly or accidentally discriminating against these groups just due to the fact that they were also historically discriminated against and that’s reflected in the data that we’re using to train the system. So me personally, a great thing that I could do would be create something that’s like the internal combustion engine of machine learning, or even like the steam engine. Those things have had an incredible effect on society and that’s been very empowering and it’s helped with a lot of progress, but it also makes it easier for people to do bad things at scale. So I’m kind of more worried about that type of problem. And I think that that type of problem isn’t necessarily a technological problem. It’s a little bit more of a system or social problem. Because I think the technology is going to happen, and so kind of the things that worry me there are along the lines of like seeing the technology and the way in which it increases people’s leverage over the world and the ability to affect it kind of at scale. I guess for me, on a day-to-day basis, like I don’t think about it too much as I’m doing research because to me, again, it’s not really so much of a technical problem. I think it would be very hard to design the technology so that it can’t do bad things.

Host: Well listen, I happen to know you didn’t start out in Montreal. So tell us a little bit about yourself. What got a young Phil Bachman interested in computer science and how did he land at Microsoft Research in Montreal?

Phil Bachman: I kind of always grew up with a computer in the home. I was fortunate in that sense, that I was always around computers and I could use them for playing games and I could do a little bit of programming. And I’m not old, but I’m not in the youngest demographic that you would see, uhhh, working in tech. And one of the things that I really liked when I was in high school, I started playing a lot of these first person games where you kind of run around and you shoot things. You know, for better or worse, it was fun. So one of the things that was a challenge at first for me was, I didn’t have great internet. So what I would do is go to the school library and look around and it turned out that you could download some bots that people had made so you could sort of fake the multi-player kind of experience. So I thought that was really cool. And one of the things I had, you know, started thinking about there was, okay, well, you know, what is it that these bots are actually doing? So I was doing a little bit of coding and like making some little simple games. So thinking about that, like how would we automate this little thing that kind of is fairly simple at its core, but that, when we let it loose in this environment – so like when we let it run around and compete with the other players – it does something interesting and fun? And so that was sort of always at the back of my mind a bit I guess. And I bounced around a little bit, academically, and starting doing research in a slightly different field, but then eventually I kind of sat around and watched a bunch of online lectures and there were a couple of areas of machine learning, like reinforcement learning for example, that really started to click with me and that I was excited about because it was getting back to those kinds of questions I’d asked myself about before, like how do we get this little bot to do interesting things. So that brought me from Texas… because I was in grad school in Texas after having done my undergraduate studies in New York… But then I found this group that was in Montreal, doing reinforcement learning, so I came and I worked with that group and that’s where I did my PhD. And then afterwards, I hung around and I liked the city pretty well, and I was looking around at kind of the jobs that were available elsewhere, and an exciting opportunity popped up here. So, there was a start-up called Maluuba that was based out of Waterloo, and it was developing kind of technology and software for doing virtual personal assistants, and the company wanted to sort of start getting more aggressive about pushing their technology forward, so they came to Montreal because there was a lot of machine learning cool stuff happening in Montreal, and then opened a research lab and, basically, as those lab doors were opening, I walked in and joined the company. And about a year later, we were actually acquired by Microsoft. So that’s how I ended up at MSR.

Host: Well, at the risk of heading into uncomfortable ice-breaker question territory, Phil, tell us one interesting thing about yourself that people might not know and how has it influenced your career as a researcher? And even if it didn’t, tell us something interesting about yourself anyway!

Phil Bachman: Personally, I’d say, one thing that I’ve always enjoyed is being fairly involved in at least one type of, let’s say, goal-oriented physical activity. That’s a super weird sounding description. But for example, as an undergrad, I did a lot of rock climbing. So having that as a thing where I could just really be focused and apply myself to solving problems in some sense – a lot of climbing is about kind of planning out what you are going to do and it’s a little bit like solving a puzzle sometimes – and having that as a thing that’s sort of separate from the work I do, but that still is kind of mentally and also physically active, and being able to kind of apply myself to that strongly. I don’t do the rock climbing specifically anymore, but what I do now is I play a lot of soccer. So I really enjoy the combination of the physical aspect as well as the mental aspect of the game, so there’s a lot of extemporaneous kind of inventive thinking. And it can be very satisfying when you kind of do something that’s exactly right at exactly the right time, especially when you realize later that you didn’t really even think about it, it just sort of happened. And I guess that might be related to some of the better moments, as a researcher, that you have when you are trying to solve a problem and you’re just kind of messing around and then something just sort of clicks and you just kind of see how you should do it.

Host: At the end of every podcast, I give my guests the proverbial last word. So tell our listeners from your perspective, what are the big challenges out there right now, and research directions that might address them, when we’re talking about machine learning research and what’s hype and what’s hope and what’s the future?

Phil Bachman: I guess one that I would say is filtering through all the different things that people are writing and saying, and trying to figure out which parts of what they are saying seem new but they are really just kind of a rewording of some concept that you’re familiar with and you just kind of have to rephrase it a little bit and then see how it fits into your existing internal framework. And being able to use that ability to figure out what’s new and what’s different and figure out how it differs from what people were trying before, and that allows you to be kind of more precise in your guesses about what is actually important. But a lot of that sort of washes out in the end and it doesn’t really survive that long. Sort of at the beginning, as a researcher, you have to, you know, rely on other people because you don’t really know where you are going yet, but over time, taking those training wheels off a little bit and developing your own personal internal framework for how you think about problems, so that when you get new information, you can kind of quickly contextualize it and figure out which are the new bits that are actually going to change the way that you look at things, and which bits are sort of just a different version of something that you already have.