Multi-Armed Bandits

This is an umbrella project for several related efforts at Microsoft Research Silicon Valley that address various Multi-Armed Bandit (MAB) formulations motivated by web search and ad placement. The MAB problem is a classical paradigm in Machine Learning in which an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies.

This page is inactive since the closure of MSR-SVC in September 2014.

The name “multi-armed bandits” comes from a whimsical scenario in which a gambler faces several slot machines, a.k.a. “one-armed bandits”, that look identical at first but produce different expected winnings. The crucial issue here is the trade-off between acquiring new information (exploration) and capitalizing on the information available so far (exploitation). While the MAB problems have been studied extensively in Machine Learning, Operations Research and Economics, many exciting questions are open. One aspect that we are particularly interested in concerns modeling and efficiently using various types of side information that may be available to the algorithm.

Contact: Alex Slivkins.

Research directions

  • MAB with similarity information
  • MAB in a changing environment
  • Explore-exploit tradeoff in mechanism design
  • Explore-exploit learning with limited resources
  • Risk vs. reward tradeoff in MAB

External visitors and collaborators

Prof. Sébastien Bubeck (opens in new tab) (Princeton)
Prof. Robert Kleinberg (opens in new tab) (Cornell)
Filip Radlinski (opens in new tab) (MSR Cambridge)
Prof. Eli Upfal (opens in new tab) (Brown)

Former interns
Yogi Sharma (opens in new tab) (Cornell —> Facebook; intern at MSR-SV in summer 2008)
Umar Syed (opens in new tab) (Princeton —> Google; intern at MSR-SV in summer 2008)
Shaddin Dughmi (opens in new tab) (Stanford —>USC; intern at MSR-SV in summer 2010)
Ashwinkumar Badanidiyuru (opens in new tab) (Cornell –> Google; intern at MSR-SV in summer 2012)

MAB problems with similarity information

  • Multi-armed bandits in metric spaces (opens in new tab)
    Robert Kleinberg, Alex Slivkins and Eli Upfal (STOC 2008 (opens in new tab))
    Abstract We introduce a version of the stochastic MAB problem, possibly with a very large set of arms, in which the expected payoffs obey a Lipschitz condition with respect to a given metric space. The goal is to minimize regret as a function of time, both in the worst case and for ‘nice’ problem instances.
  • Sharp dichotomies for regret minimization in metric spaces (opens in new tab)
    Robert Kleinberg and Alex Slivkins (SODA 2010 (opens in new tab))
    Abstract We focus on the connections between online learning and metric topology. The main result that the worst-case regret is either O(log t) or at least sqrt{t}, depending on whether the completion of the metric space is compact and countable. We prove a number of other dichotomy-style results, and extend them to the full-feedback (experts) version.
  • Learning optimally diverse rankings over large document collections
    Alex Slivkins, Filip Radlinski and Sreenivas Gollapudi (ICML 2010 (opens in new tab))
    Abstract We present a learning-to-rank framework for web search that incorporates similarity and correlation between documents and thus, unlike prior work, scales to large document collections.
  • Contextual bandits with similarity information
    Alex Slivkins (COLT 2011 (opens in new tab))
    Abstract In the ‘contextual bandits’ setting, in each round nature reveals a ‘context’ x, algorithm chooses an ‘arm’ y, and the expected payoff is µ(x,y). Similarity info is expressed by a metric space over the (x,y) pairs such that µ is a Lipschitz function. Our algorithms are based on adaptive (rather than uniform) partitions of the metric space which are adjusted to the popular and high-payoff regions.
  • Multi-armed bandits on implicit metric spaces
    Alex Slivkins (NIPS 2011 (opens in new tab))
    Abstract Suppose an MAB algorithm is given a tree-based classification of arms. This tree implicitly defines a “similarity distance” between arms, but the numeric distances are not revealed to the algorithm. Our algorithm (almost) matches the best known guarantees for the setting (Lipschitz MAB) in which the distances are revealed.

MAB problems in a changing environment

Explore-exploit tradeoff in mechanism design

Explore-exploit learning with limited resources

  • Dynamic pricing with limited supply (opens in new tab)
    Moshe Babaioff, Shaddin Dughmi, Robert Kleinberg and Alex Slivkins (EC 2012 (opens in new tab))
    Abstract We consider dynamic pricing with limited supply and unknown demand distribution. We extend MAB techniques to the limited supply setting, and obtain optimal regret rates.
  • Bandits with Knapsacks (opens in new tab)
    Ashwinkumar Badanidiyuru, Robert Kleinberg and Alex Slivkins (FOCS 2013 (opens in new tab))
    Abstract We define a broad class of explore-exploit problems with knapsack-style resource constraints, which subsumes dynamic pricing, dynamic procurement, pay-per-click ad allocation, and many other problems. Our algorithms achieve optimal regret w.r.t. the optimal dynamic policy.

Risk vs. reward trade-off in MAB