# NeurIPS 2022: Seven Microsoft Research Papers Selected for Oral Presentations

Published

Microsoft is proud to be a platinum sponsor of the 36th annual conference on Neural Information Processing Systems (NeurIPS), which is widely regarded as the world’s most prestigious research conference on artificial intelligence and machine learning.

Microsoft has a strong presence at NeurIPS again this year, with more than 150 of our researchers participating in the conference and 122 of our research papers accepted. Our researchers are also taking part in 10 workshops, four competitions and a tutorial.

In one of the workshops, AI for Science: Progress and Promises, a panel of leading researchers will discuss how artificial intelligence and machine learning have the potential to advance scientific discovery. The panel will include two Microsoft researchers: Max Welling, Vice President and Distinguished Scientist, Microsoft Research AI4Science, who will serve as moderator, and Peter Lee, Corporate Vice President, Microsoft Research and Incubations.

Of the 122 Microsoft research papers accepted for the conference, seven have been selected for oral presentations during the virtual NeurIPS experience the week of December 4th. The oral presentations provide a deeper dive into each of the featured research topics.

In addition, two other Microsoft research papers received Outstanding Paper Awards for NeurIPS 2022. One of those papers, Gradient Estimation with Discrete Stein Operators, explains how researchers developed a gradient estimator that achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations, which has the potential to improve problem solving in machine learning. In the other paper, A Neural Corpus Indexer for Document Retrieval, researchers demonstrate that an end-to-end deep neural network that unifies training and indexing stages can significantly improve the recall performance of traditional document retrieval methods.

Below we have provided the titles, authors and abstracts for all seven of the Microsoft research papers chosen for oral presentations at NeurIPS, with links to additional information for those who want to explore the topics more fully:

### Uni[MASK]: Unified Inference in Sequential Decision Problems

Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin

### K-LITE: Learning Transferable Visual Models with External Knowledge

Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao

Abstract: The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency.

In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https://github.com/microsoft/klite.

### Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.

In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression.

Our simplified pipeline demonstrates that:

(1) we can skip the pre-training knowledge distillation to obtain a 5-layer \bert while achieving better performance than previous state-of-the-art methods, like TinyBERT;

(2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

### On the Complexity of Adversarial Decision Making

Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, Karthik Sridharan

Abstract: A central problem in online learning and decision making—from bandits to reinforcement learning—is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision-making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show—via new upper and lower bounds—that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results –both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy and the Exploration-by-Optimization objective of Lattimore and György.

### Maximum Class Separation as Inductive Bias in One Matrix

Tejaswi Kasarla, Gertjan J. Burghouts, Max van Spengler, Elise van der Pol, Rita Cucchiara, Pascal Mettes

Abstract: Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs with class vectors and separating class vectors angularly.

This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations. The main observation behind our approach is that separation does not require optimization but can be solved in closed-form prior to training and plugged into a network. We outline a recursive approach to obtain the matrix consisting of maximally separable vectors for any number of classes, which can be added with negligible engineering effort and computational overhead. Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet. We find empirically that maximum separation works best as a fixed bias; making the matrix learnable adds nothing to the performance. The closed-form implementation and code to reproduce the experiments are available on GitHub.

### Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis

Tim Pearce, Jong-Hyeon Jeong, Yichen Jia, Jun Zhu

Abstract: This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterization of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimization of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimizes a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximization, and secondly that it exhibits a desirable `self-correcting’ property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.

### Learning (Very) Simple Generative Models Is Hard

Sitan Chen, Jerry Li, Yuanzhi Li

Abstract: Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $$F:\mathbb{R}^d\to\mathbb{R}^{d’}$$, let $$D$$ be the distribution over $$\mathbb{R}^{d’}$$ given by pushing the standard Gaussian $$\mathcal{N}(0,\textrm{Id}_d)$$ through $$F$$. Given i.i.d. samples from $$D$$, the goal is to output $${any}$$ distribution close to $$D$$ in statistical distance.

We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $$F$$ are one-hidden-layer ReLU networks with $$\log(d)$$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for $$supervised$$ $$learning$$ and required at least two hidden layers and $$poly(d)$$ neurons [Daniely-Vardi ’21, Chen-Gollakota-Klivans-Meka ’22].

The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $$f$$ with polynomially-bounded slopes such that the pushforward of $$\mathcal{N}(0,1)$$ under $$f$$ matches all low-degree moments of $$\mathcal{N}(0,1)$$.