Exploring the fundamentals of multi-armed bandits


If you haven’t ever been in a casino, you may have found yourself asking one very pertinent question: On which slot machine am I going to hit the jackpot? Standing in front of a bank of identical-looking machines, you have only instinct to go on. It isn’t until you start putting your money into these one-armed bandits, as they’re also known, that you get a sense of which are hot and which are not, and when you find one that’s paying out regularly, you might stick with it in hopes of winning big. Though seemingly specific to the Las Vegas Strip, this scenario boils down to an exploration-exploitation tradeoff: make a decision based on what you already know and miss out on a potentially bigger reward or spend time and resources continuing to gather information.

In machine learning and operations research, this tradeoff is captured by multi-armed bandits, a simple but very powerful framework for algorithms that take actions and learn over time under uncertain conditions. The framework makes the exploration-exploitation tradeoff more tractable and is readily extendable to a variety of more complex scenarios. It plays a crucial role in a variety of application domains, including content personalization, search engines, e-commerce, and cloud systems.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

I’m pleased to announce the release of Introduction to Multi-Armed Bandits, an introductory text drawing on 13-plus years of experience in the area and based largely on a graduate-level course I taught at the University of Maryland. I designed the book to provide easy entry into the subject matter and a strong foundational understanding with which to explore more advanced work in the area.

The book—a breakdown

Multi-armed bandits have been continuously studied since William Thompson’s seminal paper on the subject was published in 1933 and have experienced a surge in interest over the past two decades. The resulting body of work on bandits is huge and, at times, technically complicated. In the book, I focus on the fundamentals of important directions undertaken. Each of the 11 chapters explores a different direction, such as stochastic bandits, lower bounds, adversarial bandits, and contextual bandits. The last three chapters address economic aspects of the problem space—essentially, how bandit algorithms interact with self-interested parties that pursue their own incentives—which I find particularly interesting. These self-interested parties can, for example, be buyers in a market, advertisers in an ad exchange, users in a recommendation system, or other bandit algorithms.

Each chapter gives a technical introduction to the respective direction, roughly corresponding to a three-hour lecture, followed by a literature review and, in most cases, some exercises. The literature review provides perspective and pointers for further reading.

The multi-armed bandit problem is both deeply theoretical and deeply practical. More often than not, real-world scenarios are complex, encompassing many aspects and factors. If we try to solve everything right away, then we probably won’t solve anything at all. Theory allows us to divide and conquer—to understand the different aspects in isolation and then build up toward realistic solutions. Throughout the book, I seek to capture the importance of both theory and practical applications and how they relate.

Introduction to Multi-Armed Bandits is great for beginning graduate students; instructors looking for a course textbook or lecture material; and researchers who want to familiarize themselves with the subject. The book, published with Foundations and Trends in Machine Learning, is available in softcover or as an e-book. It’s also available in plain format on arXiv.org. Feedback is welcome, as I plan to update the text in the future.

Bandits at Microsoft

Multi-armed bandits is a very active research area at Microsoft, both academically and practically. A company project on large-scale applications of bandits has undergone many successful deployments and is currently available as an open-source library and a service on Microsoft Azure.

My book complements multiple books and surveys on the subject that have been published over the years. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems from my Microsoft colleague and collaborator Sébastien Bubeck and his co-author has been a standard reference since 2012.

Alex Slivkins is a Principal Researcher at Microsoft Research New York City. His research interests are in algorithms and theoretical computer science, spanning machine learning theory, algorithmic economics, and networks. He is particularly interested in online machine learning and exploration-exploitation tradeoff, and their manifestations in socioeconomic environments.