Microsoft Research Blog

Artificial intelligence

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

November 2, 2021

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of…
Constrained Language Models Yield Few-Shot Semantic Parsers

November 1, 2021

We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use…
The emergence of the shape bias results from communicative efficiency

November 1, 2021

By the age of two, children tend to assume that new word categories are based on objects’ shape, rather than their color or texture; this assumption is called the shape bias. They are thought to learn this bias by observing that their caregiver’s language is…
An Information-theoretic Approach to Distribution Shifts

November 1, 2021 | Marco Federici, Ryota Tomioka, and Patrick Forré

Safely deploying machine learning models to the real world is often a challenging process. Models trained with data obtained from a specific geographic location tend to fail when queried with data obtained elsewhere, agents trained in a simulation can struggle to adapt when deployed in…
How Powerful is Graph Convolution for Recommendation

October 26, 2021

Graph convolutional networks (GCNs) have recently enabled a popular class of algorithms for collaborative filtering (CF). Nevertheless, the theoretical underpinnings of their empirical successes remain elusive. In this paper, we endeavor to obtain a better understanding of GCN-based CF methods via the lens of graph…
ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition

October 11, 2021

Object recognition has made great advances in the last decade, but predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning…
MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

October 11, 2021

Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised…
Taming Sparsely Activated Transformer with Stochastic Experts.

October 7, 2021

Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most…
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

October 1, 2021

Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge…
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

October 1, 2021

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The…
Multi-modal Program Inference: a Marriage of Pre-trained Language Models and Component-based Synthesis

October 1, 2021

Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more “complete” task description.…
Deep High-Resolution Representation Learning for Visual Recognition

September 30, 2021

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet,…

No results