New perspectives on contextual bandit

Published July 13, 2018

By Sébastien Bubeck , Vice President, Microsoft GenAI

Share this page

What is the common denominator between the following situations: a doctor choosing among different drugs for a sequence of patients; a website selecting ads for its visitors based on the information it has about them; and a travel agency making recommendations to its customers for their next vacation? All can be viewed as instances of an abstract mathematical problem known as contextual multi-armed bandit. In the last decade, this problem has been a focus of the interactive machine learning community. It is rare for abstract problems to have immediate applications, but the contextual bandit is both rich mathematically speaking while it also models fundamental problems out in the world in a wide variety of current applications.

Microsoft Research, has been working extensively on both refining the theory of contextual bandit, as well as making the known algorithms scalable to internet-scale applications. The New York City’s team of bandit experts has made several breakthroughs in the field. Particularly striking was when researchers Alekh Agarwal, John Langford, Rob Schapire, Akshay Krishnamurthy, and Haipeng Luo put forth (opens in new tab) an open problem about an elementary question on contextual bandit at the 2017 Conference On Learning Theory in Amsterdam, The Netherlands. This open problem immediately attracted the attention of fellow conference attendee Sebastien Bubeck, Senior Researcher at Microsoft Research in Redmond, Washington and an expert in multi-armed bandit theory. Upon his return to Redmond, he started to work on the problem with his intern, Yuanzhi Li, a graduate student at Princeton University, and he also enrolled his new colleague Zeyuan Allen-Zhu who had recently joined the Machine Learning and Optimization team in Redmond.

The open problem from the New York City research team can be phrased as follows. In contextual bandit one tries to identify online the best strategy from some set of fixed policies (for example, defined by some set of neural networks). A basic performance measure is the so-called “regret”, that controls how much one regrets not playing the optimal policy. Seminal work from the 1990s gave an algorithm, known as Exp4, which was proved to be unimprovable without restricting in some ways the set of policies. What the 2017 open problem asks is this: knowing a priori that there is a very good policy, can we hope to get a much smaller regret? Intuitively this should be the case, as such a good policy should stand out more rapidly. In the bandit theory community this is known as a first order regret bound and such bounds were known for many bandit settings. Therefore, it was really surprising that the contextual version of the question seemed to require new tools.

In work (opens in new tab) to appear this month at ICML 2018 (opens in new tab) in Stockholm, Sweden, Allen-Zhu, Bubeck and Li succeeded in solving the open problem and gave the first contextual bandit algorithm with an optimal first order regret bound. It turned out that the intuition from the New York City researchers was correct and that new tools were required to solve the problem. A major conceptual contribution of the paper is the introduction of auxiliary experts. This idea, while powerful in theory, also has limitations in practice. In particular the Microsoft Research New York City team’s open problem was also asking for a scalable algorithm. This challenge remains open to date but the Redmond researchers hope to tackle it too. One thing is certain, more discoveries in contextual bandit are awaiting!