Return to Blog Home

Microsoft Research Blog

Everything you always wanted to know about extreme classification (but were afraid to ask)


In a world in which search has become ubiquitous, most of us are familiar with the experience of getting back less than satisfactory (also known as completely irrelevant) search results. What would you ask your search engine if your initial query—say, cam procedure shoulder—doesn’t return results you like? Sure, you could go through a list of the top 100 million queries currently being asked to see which ones might be better, a process that would take you three years, assuming it only takes you a single second to read through each query.

Seriously. Who’s got that kind of time to data mine?

Fortunately, you and your sore shoulder can breathe a sigh of relief regarding those three sleepless years. A team from Microsoft decided to take an innovative run at these very real and challenging problems. Indeed, their solution, one that is turning heads at the 12thAnnual ACM International Conference on Web Search and Data Mining – WSDM 2019 in Melbourne – demonstrates that the aforementioned task can be performed in—wait for it—milliseconds.

“Multi-label classification is the science of building algorithms that can answer multiple choice questions involving uncertainty when there could be multiple correct answers,” explained Manik Varma, Principal Researcher at Microsoft Research India, who led the team. Varma likens the problem to ordering a family meal at a restaurant from a menu of choices—you need to determine which subset of dishes from the menu will be the most satisfying within the given budget. This can be a much harder problem to solve than traditional multi-class classification, where one would only have to choose a single dish from the menu. And so, despite decades of research, computer scientists were only able to tackle problems involving a small number of choices; there was no hope of even thinking about query recommendation on a search engine as a multi-label classification task.

Nominations open for 2021 PhD Fellowship

A two-year fellowship for North American PhD students.

Enter extreme classification. “A mere half-decade ago, multi-label algorithms could barely scale to questions involving five thousand choices,” recalls Himanshu Jain, co-author on the WSDM paper and Varma’s PhD student at the Indian Institute of Technology Delhi who interned at Microsoft. That was then. In 2013, Varma’s team published a paper that exploded the number of choices that could be considered from five thousand to ten million. This changed the nature of the game and led to the establishment of a new research area in machine learning called extreme classification. Extreme classification deals with multi-class and multi-label problems involving an extremely large number of choices. Since then, extreme classification has opened a new paradigm for ranking and recommendation applications, such as suggesting related queries on a search engine.

Decisions, decisions. New paradigm, anyone? Image courtesy of

“The whole notion of what constitutes a good answer changes when going from a thousand choices to a million choices” explained Varma. And indeed, many new research questions arise at this scale that the machine learning community had not looked at traditionally. For example, the traditional paradigm in supervised learning is to assume that there exists an expert who can provide the right answers for questions that can be used for evaluating the performance of an algorithm or for hyper-parameter tuning. “Unfortunately, no human expert in the world can go through a list of 10 million choices to mark out all the correct answers,” explained Varma. As a result of this, only partial answers for each question exist at the extreme scale, thereby allowing even the most fundamental machine learning techniques, such as cross-validation, to potentially go awry. These new research questions, coupled with high-impact ranking and recommendation applications, have made extreme classification a hot area of research in both academia and industry. “The extreme classification community has made remarkable progress over the last six years. Papers on extreme classification have been published in diverse conferences including AAAI, AISTATS, ICML, IJCAI, KDD, NIPS, SIGIR, WSDM, WWW and others, improving the prediction accuracy on a benchmark task from 19% in 2013 to 65% today,” observed Varma.

Unfortunately, the technical challenge in extreme classification is to not just increase the prediction accuracy but to also reduce the training time, prediction time and model size. The time required for training the classifiers is one of the biggest issues in extreme classification. “In 2013, training our tree-based extreme classifier, called MLRF, on a Wikipedia-sized dataset took seven hours on a cluster of a thousand cores, which was simply unacceptable at Bing’s scale,” explained Varma. A new algorithm had to be developed.

They decided to call it Slice.

How it works

Slice stands for Scalable LInear extreme ClassifiErs and it can be more than ten thousand times faster to train than MLRF. Ten. Thousand. Times. The algorithm accurately and efficiently scales to problems involving 100 million labels and 240 million training points, blowing the doors off the scaling capabilities of all other classifiers in the world. Slice learns a linear classifier per label but cuts down training and prediction costs from linear to logarithmic in the number of labels. It does so by leveraging the observation that only a small number of labels, say logarithmic, are active in any given region of feature space. Given a test point, it quickly determines which region of feature space it belongs to based on approximate nearest neighbor search. It then evaluates the classifiers for only the labels active in the region thereby incurring just logarithmic prediction costs. During training, given N points in D dimensions with L labels, Slice reduces training costs from O(NDL) to O(ND log L) by training each linear classifier on only O((N log L)/L) of the hardest negative examples rather than all O(N) negative examples. This is achieved via a novel negative sub-sampling technique based on a generative model approximation of the discriminative linear classifier. The technique is particularly effective for low-dimensional deep learning features for which it can efficiently sub-sample a few hundred of the hardest negatives from hundreds of millions of points without losing accuracy.



Worth it?

Oh yes.

Extreme classification opens a new paradigm for thinking about ranking and recommendation problems. By treating each item to be ranked or recommended as a separate label, we can learn an extreme classifier that takes the user’s feature vector as input, predicts the subset of relevant labels as output and then return the items corresponding to the predicted labels to the user either as a recommended bag or as a ranked list, depending on the application. Reformulating the problem this way can, in certain cases, lead to serious improvements in results over traditional collaborative filtering, learning-to-rank or content-based approaches to ranking and recommendation.

Take again the aforementioned task of recommending related queries on the Bing search engine. If you treat each of the top 100 million queries on Bing as a separate label, take the user query as input, and predict the relevant subset of 100 million labels (queries) as output, this is an extreme classification reformulation of your query recommendation problem.

Shaping the problem this way lets Bing make more relevant recommendations with a 12 percent improvement in the success rate for tail queries thereby allowing millions of users to complete their tasks more efficiently.

“The first time I heard Manik’s talk on Extreme Classification, several years ago, it took me a while to understand what he was trying to do. It was an entirely new way to look at multi-label classification and efficiency as well as model size were a few reasons why the approach could not possibly work. Over the years, Manik and his collaborators have made so much progress that it has spawned a sub-area in Machine Learning. Top ML conferences now have workshops on Extreme Classification. I expect this work to have continued scientific impact. This is a great example of how sustained research can make impossible things possible.” – Sriram Rajamani, Managing Director of Microsoft Research India

Going back to the query we opened the article with—cam procedure shoulder (a query about comprehensive arthroscopic management shoulder surgery), Bing in the past returned irrelevant suggestions such as the shoulder surgery of football quarterback Cam Newton. Now, as far as we’re concerned Cam can do no wrong—not after that insane fourth quarter comeback against the Eagles in Week 7. But using Slice, Bing quickly realized that Cam Newton was a negative hit for this search query and returned much more relevant and accurate responses. Indeed, Slice can surpass state-of-the-art query recommendation techniques by at least 10 percent. Furthermore, when flighted as part of Bing’s related searches capability, Slice increased the number of queries for which at least eight recommendations were being made by 52 percent and the number of suggestions per query by 33 percent. Slice has something in common with Cam Newton after all: killer stats.

Looking beyond search and information retrieval, extreme classification can also have applications in computational advertising, recommender systems, computer vision and even natural language processing. For instance, you could use extreme classification to predict which word was going to be typed next on Bing. Or, you could use it to recognize the stranger you bumped into at the conference. Or, you might even use it to recommend which music track you should listen to next. And, of course, extreme classification is already being used by millions of people around the world to more effectively find and purchase the goods and services that they are looking for through the Bing Ads system.

Extreme classification opens new approaches and unique applications anytime you have hundreds of millions of things to classify—which these days is pretty frequently. But it also opens up new vistas in research and begs the question, what other problems, reformulated as extreme classification, might we now have a better shot at solving?

Interested? I should hope so. Seriously, Varma and his team at Microsoft and IIT Delhi encourage anyone to dive into The Extreme Classification Repository containing code, datasets, metrics, benchmark results and publications and they look forward to seeing what you might come up with.

Français English