Multi-armed bandits are the most basic model for online learning with interaction: in each round, a learner takes an action, and in return receives a numerical reward. The goal of the learner is optimize her action-selection policy to maximize the total reward received. Very often in practice, the learner has access to contextual information in each round to infer which action leads to the highest reward. This kind of bandit problems, known as contextual bandits, turn out to be able to capture a number of important Web applications such as news recommendation, targeted advertising, ranking, etc. The key challenge here is the exploration/exploitation tradeoff, especially when contextual information is used. My work has resulted in principled and effective algorithms with parametric models such as linear or generalized linear models, as well as expert-style algorithms for adversarial scenarios.
Sample complexity of exploration in reinforcement learning studies the fundamental question of how fast a reinforcement learner can approach a near-optimal policy by interacting with an initially unknown environment. The key challenge again is the exploration/exploitation tradeoff, and is harder to solve than in the simpler bandit setting. Thanks to the Knows What It Knows framework for prediction-error-aware supervised learning, we developed a meta-algorithm that provably enjoys polynomial sample complexity in a wide range of reinforcement learning problems. More details are found in the (shorter) survey and the (longer) dissertation.
Offline evaluation of learning algorithms aims at providing performance estimates reliably based on log data, without deploying the algorithms in the real system (for cost/risk reasons). In interactive problems, offline evaluation (sometimes known as off-policy reinforcement learning) becomes much harder because of counterfactual effects. From real data analysis, we showed offline evaluation can be done reliably for multi-armed bandits. Extensions are possible with help of statistical techniques such as importance sampling, rejection sampling, and doubly robust estimation. I also helped release benchmark datasets for bandit algorithms while with Yahoo!, one of which was used in a PASCAL2 Challenge.
Online learning provides a viable solution to large-scale machine learning. Truncated gradient is one such example that solves the celebrated Lasso approximately and efficiently in a stochastic-gradient-descent fashion. Another example is propensity score estimation, a critical step for answering what-if questions in counterfactual analysis. Online learning can also be combined with parallel computing to yield even greater speedup. I also helped develop the Vowpal Wabbit software that has found many uses in industry.
Online recommendation of contents (like ads, news, friends, music) is ubiquitous on the Web, and can be naturally modeled as a multi-armed bandit where the reward can be clicks, revenue, etc. I am more interested in personalized online recommendation which takes advantage of contextual information of users/queries to produce recommendations of greater quality. Algorithms that we developed for multi-armed bandits have shown great benefits in problems like news recommendation and targeted adverting.
Learning to rank from user clicks can sometimes be very useful for optimizing a ranking function based on what users want, together with judgments of human labelers. One example is recency search for time-sensitive queries, where both relevance and freshness have to be taken into account.
- Area Chair for ICML (2012-2015) and NIPS (2014).
- Senior Program Committee Member for IJCAI’2011.
- Reviewer for AAAI, AISTATS, ALT, COLT, ECML, KDD, ICML, IJCAI, NIPS, STOC, UAI, UbiComp, WSDM, WWW.
- Regular reviewing services for Journal of Machine Learning Research, Machine Learning, Journal of Artificial Intelligence Research, and Artificial Intelligence, among other journals and transactions.