Microsoft Research Blog

Artificial intelligence

  1. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts 

    November 2, 2021

    We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of…

  2. Constrained Language Models Yield Few-Shot Semantic Parsers 

    November 1, 2021

    We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use…

  3. The emergence of the shape bias results from communicative efficiency 

    November 1, 2021

    By the age of two, children tend to assume that new word categories are based on objects’ shape, rather than their color or texture; this assumption is called the shape bias. They are thought to learn this bias by observing that their caregiver’s language is…

  4. An Information-theoretic Approach to Distribution Shifts 

    November 1, 2021 | Marco Federici, Ryota Tomioka, and Patrick Forré

    Safely deploying machine learning models to the real world is often a challenging process. Models trained with data obtained from a specific geographic location tend to fail when queried with data obtained elsewhere, agents trained in a simulation can struggle to adapt when deployed in…

  5. How Powerful is Graph Convolution for Recommendation 

    October 26, 2021

    Graph convolutional networks (GCNs) have recently enabled a popular class of algorithms for collaborative filtering (CF). Nevertheless, the theoretical underpinnings of their empirical successes remain elusive. In this paper, we endeavor to obtain a better understanding of GCN-based CF methods via the lens of graph…

  6. ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition 

    October 11, 2021

    Object recognition has made great advances in the last decade, but predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning…

  7. Taming Sparsely Activated Transformer with Stochastic Experts. 

    October 7, 2021

    Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most…

  8. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding 

    October 1, 2021

    This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The…

  9. Deep High-Resolution Representation Learning for Visual Recognition 

    September 30, 2021

    High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet,…