Portrait of Lihong Li

Lihong Li

Senior Researcher


I am a member of the Deep Learning Technology Center at Microsoft Research.

My research focuses on machine learning in interactive problems, where the objective is not merely to make accurate predictions, but to optimize a certain reward function by taking good actions.  An example is online advertising, whose ultimate goal is to decide which ad to show to maximize revenue over time.  These problems are known as reinforcement learning in general, and multi-armed bandits in simplified settings.


Multiworld Testing

Established: November 1, 2013

Exponentially better than A/B testing. Multiworld Testing (MWT) is the capability to test and optimize over K policies (context-based decision rules) using an amount of data and computation that scales logarithmically in K, without…

Explore-Exploit Learning @MSR-NYC

Established: October 24, 2013

This is an umbrella project for machine learning with explore-exploit tradeoff: the trade-off between acquiring and using information. This is a mature, yet very active, research area studied in Machine Learning, Theoretical Computer Science, Operations Research, and Economics. Much of…

















Link description

Big Data and Machine Learning


May 22, 2014


Jonathan Huang, Lihong Li, and Mike Zyskowski


Microsoft, Stanford University


Basic Research

Multi-armed bandits are the most basic model for online learning with interaction: in each round, a learner takes an action, and in return receives a numerical reward.  The goal of the learner is optimize her action-selection policy to maximize the total reward received.  Very often in practice, the learner has access to contextual information in each round to infer which action leads to the highest reward.  This kind of bandit problems, known as contextual bandits, turn out to be able to capture a number of important Web applications such as news recommendation, targeted advertising, ranking, etc.  The key challenge here is the exploration/exploitation tradeoff, especially when contextual information is used.  My work has resulted in principled and effective algorithms with parametric models such as linear or generalized linear models, as well as expert-style algorithms for adversarial scenarios.

Sample complexity of exploration in reinforcement learning studies the fundamental question of how fast a reinforcement learner can approach a near-optimal policy by interacting with an initially unknown environment.  The key challenge again is the exploration/exploitation tradeoff, and is harder to solve than in the simpler bandit setting.  Thanks to the Knows What It Knows framework for prediction-error-aware supervised learning, we developed a meta-algorithm that provably enjoys polynomial sample complexity in a wide range of reinforcement learning problems.  More details are found in the (shorter) survey and the (longer) dissertation.

Offline evaluation of learning algorithms aims at providing performance estimates reliably based on log data, without deploying the algorithms in the real system (for cost/risk reasons).  In interactive problems, offline evaluation (sometimes known as off-policy reinforcement learning) becomes much harder because of counterfactual effects.  From real data analysis, we showed offline evaluation can be done reliably for multi-armed bandits.  Extensions are possible with help of statistical techniques such as importance sampling, rejection sampling, and doubly robust estimation.  I also helped release benchmark datasets for bandit algorithms while with Yahoo!, one of which was used in a PASCAL2 Challenge.

Online learning provides a viable solution to large-scale machine learning.  Truncated gradient is one such example that solves the celebrated Lasso approximately and efficiently in a stochastic-gradient-descent fashion.  Another example is propensity score estimation, a critical step for answering what-if questions in counterfactual analysis.  Online learning can also be combined with parallel computing to yield even greater speedup.  I also helped develop the Vowpal Wabbit software that has found many uses in industry.

Applied Research

Online recommendation of contents (like ads, news, friends, music) is ubiquitous on the Web, and can be naturally modeled as a multi-armed bandit where the reward can be clicks, revenue, etc.  I am more interested in personalized online recommendation which takes advantage of contextual information of users/queries to produce recommendations of greater quality.  Algorithms that we developed for multi-armed bandits have shown great benefits in problems like news recommendation and targeted adverting.

Learning to rank from user clicks can sometimes be very useful for optimizing a ranking function based on what users want, together with judgments of human labelers.  One example is recency search for time-sensitive queries, where both relevance and freshness have to be taken into account.

Spoken dialog management is one application area of reinforcement learning, which can be used to optimize the policy of a conversational system so that it can better interact with, and assist, humans.  I have worked on fast feature selection in the past, and am interested in other aspects such as imitation learning at present.

Professional Activities

  • Area Chair for ICML (2012-2015) and NIPS (2014).
  • Senior Program Committee Member for IJCAI’2011.
  • Regular reviewing services for Journal of Machine Learning Research, Machine Learning, Journal of Artificial Intelligence Research, and Artificial Intelligence, among other journals and transactions.Locations of visitors to this page