Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

New perspectives on contextual bandit

July 13, 2018 | By Sébastien Bubeck, Senior Researcher

What is the common denominator between the following situations: a doctor choosing among different drugs for a sequence of patients; a website selecting ads for its visitors based on the information it has about them; and a travel agency making recommendations to its customers for their next vacation? All can be viewed as instances of an abstract mathematical problem known as contextual multi-armed bandit. In the last decade, this problem has been a focus of the interactive machine learning community. It is rare for abstract problems to have immediate applications, but the contextual bandit is both rich mathematically speaking while it also models fundamental problems out in the world in a wide variety of current applications.

Microsoft Research, has been working extensively on both refining the theory of contextual bandit, as well as making the known algorithms scalable to internet-scale applications. The New York City’s team of bandit experts has made several breakthroughs in the field. Particularly striking was when researchers Alekh Agarwal, John Langford, Rob Schapire, Akshay Krishnamurthy, and Haipeng Luo put forth an open problem about an elementary question on contextual bandit at the 2017 Conference On Learning Theory in Amsterdam, The Netherlands. This open problem immediately attracted the attention of fellow conference attendee Sebastien Bubeck, Senior Researcher at Microsoft Research in Redmond, Washington and an expert in multi-armed bandit theory. Upon his return to Redmond, he started to work on the problem with his intern, Yuanzhi Li, a graduate student at Princeton University, and he also enrolled his new colleague Zeyuan Allen-Zhu who had recently joined the Machine Learning and Optimization team in Redmond.

The open problem from the New York City research team can be phrased as follows. In contextual bandit one tries to identify online the best strategy from some set of fixed policies (for example, defined by some set of neural networks). A basic performance measure is the so-called “regret”, that controls how much one regrets not playing the optimal policy. Seminal work from the 1990s gave an algorithm, known as Exp4, which was proved to be unimprovable without restricting in some ways the set of policies. What the 2017 open problem asks is this: knowing a priori that there is a very good policy, can we hope to get a much smaller regret? Intuitively this should be the case, as such a good policy should stand out more rapidly. In the bandit theory community this is known as a first order regret bound and such bounds were known for many bandit settings. Therefore, it was really surprising that the contextual version of the question seemed to require new tools.

In work to appear this month at ICML 2018 in Stockholm, Sweden, Allen-Zhu, Bubeck and Li succeeded in solving the open problem and gave the first contextual bandit algorithm with an optimal first order regret bound. It turned out that the intuition from the New York City researchers was correct and that new tools were required to solve the problem. A major conceptual contribution of the paper is the introduction of auxiliary experts. This idea, while powerful in theory, also has limitations in practice. In particular the Microsoft Research New York City team’s open problem was also asking for a scalable algorithm. This challenge remains open to date but the Redmond researchers hope to tackle it too. One thing is certain, more discoveries in contextual bandit are awaiting!

Up Next


Chasing convex bodies and other random topics with Dr. Sébastien Bubeck

Episode 53, December 5, 2018 - Dr. Sébastien Bubeck explains the difficulty of the multi-armed bandit problem in the context of a parameter- and data-rich online world. He also discusses a host of topics from randomness and convex optimization to metrical task systems and log n competitiveness to the surprising connection between Gaussian kernels and what he calls some of the most beautiful objects in mathematics.

Microsoft blog editor

Unlikely research area reveals surprising twist in non-smooth optimization

Algorithms, Artificial intelligence

Unlikely research area reveals surprising twist in non-smooth optimization

Modern machine learning is characterized by two key features: high-dimensional models and very large datasets. Each of these features presents its own unique challenges, from basic issues such as storing and accessing all of the data to more intricate mathematical quests such as finding good algorithms to search through the high-dimensional space of models. In […]

Sébastien Bubeck

Senior Researcher

Algorithms, Artificial intelligence, Medical, health and genomics, Social sciences

Opportunities abound in the creative and collaborative culture of STEM

The explosion of data available today everywhere from biomedicine to the arts is opening new opportunities for researchers with backgrounds in science, technology, engineering and math to pursue creative and collaborative endeavors that have deep societal impact. My research has always been interdisciplinary, which by nature is collaborative and creative. You need experts from different […]

Jennifer Chayes

Technical Fellow & Managing Director, Microsoft Research New England, New York City and Montreal