Competing in the X Games of machine learning with Dr. Manik Varma

Published February 13, 2019

Share this page

Manik Varma, Principal Researcher at Microsoft Research India

Episode 63, February 13, 2019

If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine learning, things are a bit more complicated. That’s why Dr. Manik Varma (opens in new tab), Principal Researcher at MSR India (opens in new tab), is developing extreme classification systems to answer multiple-choice questions that have millions of possible options and help people find what they are looking for online more quickly, more accurately and less expensively.

On today’s podcast, Dr. Varma tells us all about extreme classification (including where in the world you might actually run into 10 or 100 million options), reveals how his Parabel (opens in new tab) and Slice (opens in new tab) algorithms are making high quality recommendations in milliseconds, and proves, with both his life and his work, that being blind need not be a barrier to extreme accomplishment.

Microsoft Research Podcast (opens in new tab): View more podcasts on Microsoft.com
iTunes (opens in new tab): Subscribe and listen to new podcasts each week on iTunes
Email (opens in new tab): Subscribe and listen by email
Android (opens in new tab): Subscribe and listen on Android
Spotify (opens in new tab): Listen on Spotify
RSS feed (opens in new tab)
Microsoft Research Newsletter (opens in new tab): Sign up to receive the latest news from Microsoft Research

Final Transcript

Manik Varma: In 2013, I thought there is no way we can learn 10 million or 100 million classifiers. And even if we could learn them, where would we store them? And even if we could store them, how would we make a prediction in a millisecond? And so, I just turned away from one-versus-all approaches and we tried developing trees and embeddings. But today, we’ve actually managed to overcome all of those limitations. And the key trick is to go from linear time training and predictions to log-time training and prediction.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine learning, things are a bit more complicated. That’s why Dr. Manik Varma, Principal Researcher at MSR India, is developing extreme classification systems to answer multiple-choice questions that have millions of possible options and help people find what they are looking for online more quickly, more accurately and less expensively.

On today’s podcast, Dr. Varma tells us all about extreme classification (including where in the world you might actually run into 10 or 100 million options), reveals how his Parabel and Slice algorithms are making high quality recommendations in milliseconds, and proves, with both his life and his work, that being blind need not be a barrier to extreme accomplishment. That and much more on this episode of the Microsoft Research Podcast.

Host: Manik Varma, welcome to the podcast.

Manik Varma: Thanks, Gretchen. The pleasure’s entirely mine.

Host: So, you’re a principal researcher at MSR India and an Adjunct Professor of Computer Science at IIT Delhi. In addition to your work in computer science, you’re a physicist, a theoretician, an engineer and a mathematician. You were a Rhodes Scholar at Oxford and a University Scholar when you did your doctoral work there. And you were a Post Doc at MSRI Berkeley. And you’re blind. I’m not going to call you Superman, but I would really love to know what has inspired you to get to this level of accomplishment in your life. What gets you up in the morning?

Manik Varma: I guess it’s a combination of my hopes and desires on the one hand, and I guess fears as well on the other. So, I guess hopes and desires – I hope every day I get up I learn something new. That’s one of the best feelings I’ve had, and that’s what’s driven me all this way. And actually, that’s why I’m in research, because I get to ask new questions and learn about new things throughout my career. So that’s fantastic. And the other hope and desire that I have that drives me is to build things and build things that will help millions of people around the world. And I guess there’s some fears lurking behind that as well. I’ve been worried about the “Imposter Syndrome” all my life, and, uh, yeah… So, I guess the best way to tackle that is actually to try and do things and get things out there and have people use them and be happy with them. So, I guess that’s the fear that’s driving the hopes and desires.

Host: So Manik, all of this has been without the use of your eyes. How have you gone about all of this?

Manik Varma: Right, so I have a condition where the cells in my retina are deteriorating over time. So, in about 2009, I think, I started using a blind stick, and then I lost the ability to read papers, then recognize faces, stuff like that. But it makes life interesting right? I go up to my wife and ask, who are you? That’s the secret to a happily married life!

Host: Well, so, you haven’t been blind since birth?

Manik Varma: No.

Host: Oh, okay.

Manik Varma: Well, it’s a hereditary condition, but nobody else in my family has it, so it started deteriorating from the time I was in school, and I lost the ability to do some things in school. But, it’s only over the last, let’s say, 10 years where it’s become really significant. But again, it’s been no big deal also, right? It’s only in the last decade where I’ve had to think about it, and that’s, I think, because of my kids, right? I want to set them an example that they can… because this is hereditary, there is a probability they might have it, and if they do, I don’t want them to feel that they can’t do something, right? So, if they want, they can go climb Mount Everest or become the best historian or the best scientist or whatever. As long as they set their minds up, they can do it.

Host: We’re going to talk about this relatively new area of machine learning research called extreme classification in a minute. But let’s start a bit upstream of your current work first. Give us a quick review of what we might call the machine learning research universe, as it were, focusing on a bit of the history and the types of classification that you work with, and this is just to give the framework for how extreme classification is different from other methodologies and why we need it.

Manik Varma: So, if you look at the history of machine learning, then the most well-studied problem is binary classification where we learn to answer yes/no questions involving uncertainty. And then the community realized that, actually, there are many high-impact applications out there in the real world that are not just simple yes/no questions, right? They’re actually multiple-choice questions. And this leads to the field of multi-class classification. And then, after that, the community realized that there’s some high-impact applications that are not only multiple-choice, but they also have multiple correct answers. And this led to the establishment of the area of multi-label classification. So just to take examples of all of these, if you ask the question, is Gretchen speaking right now or not, then that’s binary classification. Whereas if you turn this into a multiple-choice question such as who’s speaking right now? Is it Gretchen or Manik or Satya Nadella? So that’s a multiple-choice question, that’ll be multi-class classification. But now suppose you threw a cocktail party and you invited all the top machine learning AI scientists over there. And then you took a short clip and asked, who’s speaking in this clip? That becomes a multi-label classification problem, because now multiple people could be speaking at the same time. And so, if you have L choices in front of you, then in multi-class, the output space is L dimension, or order L. But if you have a multi-label, then the output space is two to the power L. Because every person may or may not be speaking. So, you go from two choices in binary classification to tens to hundreds and thousands of choices in multi-class and multi-label learning. And if you looked at the state of the art in about 2012, the largest multi-label data set out here had about 5,000 labels. And I remember all my colleagues like running their algorithms for weeks on big clusters to try and solve this problem, because two to the power 5,000 is way more than the number of atoms in the universe. So, it’s a hard problem.

Host: So, this has been a problem for quite some time. It’s not brand new, right? But it’s getting bigger? Is this our issue?

Manik Varma: Right. And actually, that’s how extreme classification got started. So, as I mentioned, in 2012, the largest publicly available multi-label data set had about 5,000 labels. But then in 2013, we published a paper which exploded the number of labels being considered in a multi-label classifier from 5,000 to 10 million.

Host: Wow.

Manik Varma: And that really changed the nature of the game. So, the motivating application was to build a classifier that could be used as a tool by advertisers that would predict which Bing queries would lead to a click on the ad or the document. And, you can well imagine from the context of the application that this is a really important problem, from both the research as well as a commercial perspective. And so many sophisticated natural language processing, machine learning, information retrieval techniques have been developed in the literature to solve this problem. But unfortunately, none of these were working for our ads team. They had billions of ads for which all these sophisticated approaches were not making good quality predications. And so, we decided to go back to the drawing board. We set aside all of these approaches and simply reformulated the problem as a multi-label classification problem where we would take the document as input, and then we would treat each of the top queries on Bing as a label, so you took the top 10 million monetizable queries on Bing, and now you just learn the classifier that will predict, for this particular document or ad, which subset of top 10 million Bing queries will lead to a click on the ad.

Host: Top 10 million?

Manik Varma: Yeah, so from 5,000 to 10 million. This was just a very different and completely new way of looking at the problem. And it took us two years to build the system, run the experiments, publish our results and check everything out. But once the results came in, we found that our approach was much better than all these traditional approaches. So, the number of ads for which you are making good quality recommendations went up from about 60% for the Bing system to about 95-98% for us.

Host: Wow.

Manik Varma: And the quality of our recommendations also improved a lot. And so that led to the establishment of the area of extreme classification which deals with multi-class and multi-label problems in extremely large label spaces. In millions or billions of labels. And I think that’s exactly why extreme classification grew to be a whole new research area in itself. And that’s because, I think, fundamentally new research questions arise when you go from, let’s say, 100 labels to 100 million labels. Let me just give you a couple of examples if you’ll permit me the time.

Host: Yes, please.

Manik Varma: The whole premise in supervised machine learning is that there’s an expert sitting out there who we can give our data to, and he or she will label the data with ground truth: what’s the right answer, right? What’s the right prediction to make for this? But unfortunately, in extreme classification, there is no human being who can go through a list of 100 million labels to tell you, what are the right predictions to make for this data point. So even the most fundamental machine learning techniques such as cross-validation might go for a toss at the extreme scale. And you’ll have missing labels in your test set, in your validation set, in your training set. And this is like a fundamental difference that you have with traditional classification where a human being could go through a list of 100 labels and mark out the right subset. Another really interesting question is the whole notion of what constitutes a good prediction changes when you go from 100 labels to, let’s say, 100 million. When you have 100 labels, you need to go through that list of 100 and say, okay, which labels or relevant? What labels are irrelevant? But when you have 100 million, nobody has the time or patience to go through it. So, you need to give your top five best predictions very quickly and you need to have them ranked with your best prediction at the very top and then the worst one at the bottom. And you need to make sure that you handle this “missing labels” problem, because some of the answers that you predict might not have been marked by the expert. So, all of this changes as you go from one scale to the next scale.

(music plays)

Host: Let’s talk for a second about how extreme classification can be applied in other areas besides advertising. Tell us a little bit about the different applications in this field and where you think extreme classification is going.

Manik Varma: I think one of the most interesting questions that came out of our research was, when or where in the world will you ever have 10 million or 100 million labels to choose from? If you just think about it for a minute, 100 million is a really large number. Just to put it in context, to see how big a number this is, when I was doing my PhD in computer vision, the luminaries in the field would get up and tell the community that, according to Biederman’s counts, there are only 60,000 object categories in the world. So, none of the classical visual problems will make the cut. And even if you were to pick up a fat Oxford English Dictionary, it would have somewhere around half a million words in it. So many traditional NLP problems might not also make the cut. Then over the last five years, people have actually found very high impact applications of extreme classification. And so, for example, one of them leads to reformulations of well-known problems in machine learning like ranking and recommendation, which are critical for our industry. So, suppose you wanted to, for instance, design a search engine, right? You can treat each document on the web as a label, and now when a query comes in, you can learn the classifier that will take the query’s feature vector’s input and predict which subset of documents on the web are relevant to this particular query. And so, then you can show those documents and you can rank them on the strength of the classifier’s probabilities and you can reformulate ranking as a classification problem. And similarly, think about like recommendation, right? So, suppose you were to go onto a retailer’s website. They have product catalogs that run into the millions, if not hundreds of millions. And so no human being can go through the entire catalog to find what they’re looking for. And therefore, recommendation becomes critical for helping users find things they’re looking for. And now you can treat each of the hundred million products that the retailer is selling as a particular category, and you learn a classifier that takes the user’s feature vector as input and simply predicts which subset of categories are going to be of interest to the user and you recommend the items corresponding to those categories to the user. And so you can reformulate ranking and recommendation as extreme classification, and sometimes this can lead to very large performance gains as compared to traditional methods such as collaborative filtering or learning to rank or content-based methods. And so that’s what extreme classification is really good for.

Host: Let’s talk about results for a minute. How do we measure the quality of any machine learning classification system? I imagine there are some standard benchmarks. But if it’s like any other discipline, there are tradeoffs, and you can’t have everything. So, tell us how should we think about measurement? What kinds of measurements are you looking at, and how does extreme classification help there?

Manik Varma: So, the axes along which we measure the quality of a solution are training time, prediction time, model size and accuracy. So, let’s look at these one by one. And they’re all linked also. So, if you look at training time, this is absolutely critical for us. We cannot afford to let our extreme classifiers take months or years to train. So, think of the number of markets in which Bing operates. All over the world. If it were to take you, let’s say, two days to train an extreme classifier, then every time you wanted to deploy in a new market, that would be two days gone. Every time you wanted to run an experiment where you change a feature, two days gone. Every time you want to tune hyperparameters, two days gone. And again, when I’m saying two days, this is probably on a cluster of a thousand cores, right? So how many people or how many groups have access to those kinds of clusters? Whereas if you could bring your training time down to, let’s say, 17 minutes of a single core of a standard desktop, now you can run millions of experiments and anyone can run these experiments on their personal machines. And so the speed with which you can run experiments and improve the quality of your product and your algorithm, there’s a sea change in that.

Host: Yeah.

Manik Varma: So, training time becomes very important. But as you pointed out, right, you want to maintain accuracy. So, you can’t say, oh, I’ll cut down my training time, but then I’ll make really bad predictions. That’s not acceptable. And then the second thing is model size. So, if your classifier is going to take, let’s say, a terabyte to build its model, then your cost of making recommendations will go up. You need to buy more expensive hardware. And, again, the number of machines that have a terabyte is limited, so the number of applications you can deal with in one go is limited. So again, you want to bring your model size down to a few gigabytes so that it can fit on standard hardware and anybody can use this for making predictions. And again, there’s a tradeoff between model size and prediction time, right? You can always trade speed for memory, but now the question is, how can you bring down your model size to a few gigabytes without hitting your prediction time too much? And the reason prediction time is important is because, again, at the scale of Bing, we might have 100 billion documents that we need to process regularly. So, if your prediction time is more than a few milliseconds per document, then there is no way you can make a hundred billion predictions in a reasonable amount of time. So, your solution simply will not fly. And all of this, as I said, is tied to accuracy. Because people will not suffer poor recommendations.

Host: No. I won’t.

Manik Varma: Yeah.

Host: Well so, circling back, how does extreme classification impact your training time, model size, speed, accuracy, all of that?

Manik Varma: If you look at the papers that we are publishing now, like Parabel and Slice, we can train on that same data set in, as I said, 17 minutes of a single core of a standard machine. And so, we’ve really managed to cut down training time let’s say by 10,000 times over the span of five years. We’ve managed to reduce our model size from a terabyte to a few gigabytes. Our prediction time is a few milliseconds per test point, and we managed to increase accuracy from about 19% precision at one, to about 65% today. So, precision at one is if you just look at the top prediction that was made, was that correct or not? So, on a standard benchmark data, say it was 19% in 2013, and we managed to increase that to 65% today. So, there’s been a remarkable improvement in almost all metrics over the last 5-6 years.

Host: You just mentioned Parabel and Slice. So, let’s talk about those right now. First of all, Parabel. It’s a hierarchal algorithm for extreme classification. Tell us about it and how it’s advanced both the literature of machine learning research and the field in general.

Manik Varma: So, for the last five years, we’ve been trying out lots of different approaches to extreme classification. We tried out trees. We tried out embeddings in deep learning. And we looked at one-versus-all approaches. And in a one-versus-all approach, you have to learn a classifier per label. And then you use all of them at prediction time to see which ones have the highest scores, and then that’s the classes or the labels you recommend to the user. And in 2013, I thought there is no way we can learn 10 million or 100 million classifiers. And even if we could learn them, where would we store them? And even if we could store them, how would we make a prediction in a millisecond? And so, I just turned away from one-versus-all approaches and we tried developing trees and embeddings. But today, we’ve actually managed to overcome all of those limitations. And the key trick is to go from linear time training and predictions, to log-time training and prediction. And originally, I thought this was not possible, right? Because you have to learn one classifier per label. So, if you have 100 million labels, you have to have 100 million classifiers, so your training time has to be linear in the number of labels. But, with Parabel we managed to come up with a log-time training and a log-time prediction algorithm. And the key is to learn a hierarchy over your labels. So, each node in the hierarchy inherits only about half of its parent’s labels. So, there’s an exponential reduction as you go down the hierarchy. And that lets you make predictions in logarithmic time. Now the trick in Parabel was, how do you make sure that the training is also logarithmic time? Because at the leaves of the hierarchy, you will have one, let’s say, label in the leaf node. And so, you have to have at least as many leaves as classifiers, so that gives you back the linear costs. But somehow Parabel – you’ll need to read the paper in order to get the trick. It’s a really cute trick and… but you can get away with log-time training. And so that’s why Parabel manages to get this 10,000-times speed up in prediction, in training, in reduction in model size, and accuracy is great. It’s a really cool algorithm.

Host: More recently, you’ve published a paper on what you’ve called the Slice algorithm. What’s the key innovation in Slice? How is it different from – or how does it build on – Parabel?

Manik Varma: Right. So, Slice is also a one-versus-all algorithm where you learn a classifier per label. It also has log-time training and prediction, but it’s based on a completely different principle. So, in Parabel, our approach was to learn a hierarchy. So, you keep splitting the label space in half as you go down each node. Slice has a very different approach to the problem. It says, okay, let me look at my entire feature space. And now if I look at a very small region in the feature space, then only a small number of labels will be active in this region. So now, when a new user comes in, if I can find out what region or feature space he belongs in or she belongs in very quickly, then I can just apply the classifiers in that region. And that’s the key approach that Slice takes to cutting down training time and prediction time from linear to logarithmic. And it’s about to appear in WSDM this month. And it scales to 100 million labels. I mean, if you look at ImageNet, right? So that was 1,000 classes with 1,000 training points per class. And now we have 100 million classes. So, we used it for the problem of recommending related searches on Bing. So, when you go and type in a query on a search engine, the search engine will often recommend other queries that you could have asked that might have been more helpful to you or that will highlight other facets of the topic. And so, for obvious queries, the recommendations are pretty standard and everyone can do a good job. The real fight is in the tail for these queries that are rare that we don’t know how to answer well. Can you still make good quality recommendations? And there, getting even a 1% lift in the success rate is a big deal. Like it takes dedicated time and effort to do that. And we managed to get a 12% lift in performance with Slice in the success rate. So that’s like really satisfying.

Host: Yeah, and you had some other percentages that were pretty phenomenal too in other areas. Can you talk about those a little?

Manik Varma: Right. So, we also saw increases across the board in trigger coverage, suggestion density and so on. So, trigger coverage is the number of queries for which you could make recommendations. And we saw a 52% increase in that.

Host: 52?

Manik Varma: And… yeah, that’s right. That was amazing.

Host: Statistically significant, on steroids.

Manik Varma: Right. And then the suggestion density is the number of recommendations you make per query. And there was a 33% increase in that as well. So yeah, we had pretty significant lift, and I’m very glad to say like Slice is making billions of recommendations so far. And people are really happy. It’s really improved the success rate of people asking queries on Bing so far.

Host: That’s cool. Speaking of where people can find stuff… I imagine a number of listeners would be interested to plunge in here. Where can they get this? Uh, is it available? Where are the resources?

Manik Varma: So, we maintain an extreme classification repository, which makes it really easy for researchers and practitioners who are new to the area to come in and get started. If you go to Bing and search for the extreme classification repository, or your favorite search engine, you can find it very easily. And there you’ll find not just our code, but you’ll find data sets. You’ll find metrics on how to evaluate the performance of your own algorithm. You’ll find benchmark results showing you what everybody else has achieved in the field so far. You’ll find publications and, if you look at my home page, you’ll also find a lot of talks so you can go and look at the recordings to explore more about whatever area fascinates you. And all of this is publicly available, freely available to the academic community. So, people can come in and explore whatever area of extreme classification they like.

(music plays)

Host: So, Manik, we’ve talked about the large end of extreme classification, but there’s another extreme that lies at the small end of the spectrum, and it deals with really, really, really tiny devices. Tell us a bit about the work you’ve done with what you call Resource Efficient ML.

Manik Varma: Yeah, that’s the only other project I’m working on. And that’s super cool too, right? Because for the first time in the world, we managed to put machine learning algorithms on a micro controller that’s smaller than a grain of rice. Think of the possibilities that opens up, right? So, you can now implant these things in the brains of people who might be suffering from seizures to predict the onset of a seizure. You could put them all over a farm to try and do precision agriculture, tell you where to water, where to put fertilizer and where to put pesticide and all of that. The applications are completely endless especially once you start thinking about the internet of things. A number of applications in the medical domain, in smart cities, smart housing. So, in 2017, we put two classical machine learning algorithms based on trees and nearest neighbors called Bonsai and Protonn onto this micro controller. It has only 2 kilobytes of RAM, 32 kilobytes of read-only flash memory, no supported hardware for floating point operations, and yet we managed to get our classifiers to run on them. And then last year we released two recurrent neural network algorithms called FastGRNN and EMI-RNN. And again, all of these are publicly available from GitHub. So, if you go to GitHub.com/Microsoft/HML you can download all these algorithms and play with them and use them for your own applications.

Host: So, while we’re on the topic of your other work, you said that was the only other project you were working on, but it isn’t. And maybe – maybe they’re tied together, but I’ve also heard you’re involved in some rather extra-terrestrial research. Can you tell us about the work you’re doing with SETI?

Manik Varma: Yeah, so they’re related. But apparently, some of these algorithms that we’ve been developing could have many applications in astronomy and astrophysics. So, if you look at our telescopes right now, they’re receiving data at such a high rate that it’s not possible to process all of this data or even store all of this data. So, because the algorithms we’ve developed are so efficient, if we could put them on the telescope itself, it could help us analyze all types of signals that we are receiving, including, perhaps, our search for extraterrestrial intelligence. So, that’s a really cool project we run out of Berkeley. But there are also lots of other applications, because, if you’re trying to put something on a satellite, I’m told by my astronomer friends that the amount of energy that an algorithm can consume is very limited because energy is at premium out there in space. And so, things that are incredibly energy efficient or will have very low response time are very interesting to astronomers.

Host: So Manik, I ask all my guests some form of the question, is there anything that keeps you up at night? You’re looking at some really interesting problems. Big ones and small ones. Is there anything you see that could be of concern, and what are you doing about it?

Manik Varma: Extreme classification touches people, right? So, people use it to find things they’re looking for. And so, they reveal a lot of personal information. So, we have to be extremely careful that we behave in a trustworthy fashion, where we make sure that everybody’s data is private to them. And this is not just at the individual level but also the corporate level, right? So, if you’re a company that’s looking to come to Microsoft and leverage extreme classification technology, then again, your transaction history and your logs and stuff, we make sure that those are private, and you can trust us, and we won’t share that with anybody else. And because we’re making recommendations, there are all these issues about fairness, about transparency, about explainability. And these are really important research challenges, and ones that we are thinking of at the extreme scale.

Host: At the beginning of the podcast, we talked briefly about your CV, and it’s phenomenally impressive. But tell us a little bit more about yourself. How did your interest in physics and theory and engineering and math shape what you’re doing at Microsoft Research?

Manik Varma: Yeah, so because I’ve been exposed to all of these different areas, it helps me really appreciate all the different facets of the problem that we’re working on. So, the theoretical underpinnings are extremely important. And then I’ve come to realize how important the mathematical and statistical modeling of the problem is. And then once you’ve built the model, then engineering a really good-quality solution how to do that, what kind of approximations to make, that you start from theoretically well-founded principals, but then you make good engineering approximations that will help you deliver a world-class solution. And so, it helps me look at all of these aspects of the problem and try and tackle them holistically.

Host: So, what about the physics part?

Manik Varma: Um, actually to tell you the truth, I’m a miserable physicist. I completely failed. (laughing) Yeah. I’m not very good at physics, unfortunately, which is why I switched. So…

Host: You know, I’ve got to be honest. I’ll bet your bad at physics is way better than my good at physics. So, let’s put it in context, right? All right, well at the end of every show, I ask my guests to offer some parting advice or inspiration to listeners, many of them who are aspiring researchers in the field of AI and machine learning. What big interesting problems await researchers just getting into the field right now?

Manik Varma: Yeah, we’ve been working on it for several years, but we’ve barely scratched the surface. I mean, there’s so many exciting and high-impact problems that are still open and need good quality solutions, right? So, if you’re interested in theory, even figuring out what are the theoretical underpinnings of extreme classification, how do we think about generalization error, and how do we think about the complexity of a problem? That would be such a fundamental contribution to make. If you’re interested in engineering or in the machine learning side of things, then how do you come up with algorithms that bring down your training, prediction and cost and model size from linear to logarithmic? So, can we have an algorithm that is log-time training, log-time prediction, log model size, and yet it is as accurate as a one-versus-all classifier that has linear costs? And if you’re interested in deep learning, then how can you do deep learning at the extreme scale? How do you learn at the scale where you have 100 million classes to deal with? How do you personalize extreme classification? If you wanted to build something that would be personalized for each user, how would you do that? And if you’re interested in applications, then how do you take extreme classification and apply it to all the really cool problems that there are in search, advertising, recommendation, retail, computer vision…? All of these problems are open. And we’re looking for really talented people to come and join us and work with us on this team. And location is no bar. No matter where you are in the world, we’re happy to have you. Level is no bar. As long as you’re passionate about having impact on the real world and reaching millions of users, we’d love to have you with us.

Host: So, we’ve talked about going from 100 to 100 million labels. What’s it going to take to go from 100 million to a billion labels?

Manik Varma: Yeah, that’s really exciting, and that’s actually some of the things that we’re exploring right now. In fact, not only do I want to go to a billion labels. I want to go to an infinite set of labels. That would be the next extreme in extreme classification!

Host: Yeah.

Manik Varma: And that’s really important, again, for the real world, because if you think about applications on Bing or on search or recommendation, you have new documents coming on the web all the time, or you have new products coming into a catalog all the time. And currently, our classifiers don’t deal with that. Thankfully, we’ve managed to cut down our training costs so that you can just retrain the entire classifier periodically when you have a batch of new items come in. But if you have new items coming in at a very fast rate, then you need to be able to deal with them from the get-go. And so, we’re now are starting to look at problems where there’s no limit on the number of classes that you’re dealing with.

Host: Well, I imagine the theory and the math, and then the engineering to get us to that level is going to be significant as well.

Manik Varma: But it’ll be a lot of fun. You should come and join us! (laughter)

Host: “Come on, it’ll be fun!” he said.

Manik Varma: Yeah.

Host: Manik Varma, it’s been an absolute delight talking to you today. Thanks for coming on the podcast.

Manik Varma: Thank you, Gretchen. As I said, the pleasure was entirely mine.

(music plays)

To learn more about Dr. Manik Varma and how researchers are tackling extreme problems with extreme classification, visit Microsoft.com/research