Microsoft Research Blog

Microsoft Research

Webinar Series

Online lectures from Microsoft’s computer scientists

View All Webinars

Microsoft Research


Ongoing conversations at the cutting edge of research

View All Episodes
  1. An illustration of how the image text contrastive and translation text contrastive tasks work together to help align the space of images, English text and non-English text. On the left side of the illustration, the three domains—Image Domain, English Domain, and Non-English Domain--are segregated. An arrow labeled “Image-Captions training data” points to another depiction of the three domains where the image domain and the English domain intersect but the non-English domain is still separate and shown in gray to show that it’s not significantly affected. A two headed arrow with the label “Image-Text contrastive loss” is drawn between the image and English domains. Towards the bottom of the image, an arrow labeled “Parallel corpus training data” points to another depiction of the three domains where the English domain and the non-English domain intersect but the image domain is separate and shown in gray to indicate that it is not significantly affected. A two-headed arrow with the label “Translated Text Contrastive loss” is drawn between the English and non-English domains. Finally, a third arrow with the label “Resulting Effect” is drawn to the right of the image which points to a depiction of all three domains intersecting.

    Turing Bletchley: A Universal Image Language Representation model by Microsoft

    Today, the Microsoft Turing team is thrilled to introduce Turing Bletchley, a 2.5-billion parameter Universal Image Language Representation model (T-UILR) that can perform image-language tasks in 94 languages. T-Bletchley has an image encoder and a universal language encoder that vectorize input image and text respectively so that semantically similar…
    November 1, 2021 by Saurabh Tiwary
  2. On left, text reads ORBIT benchmark dataset: 77 blind and low-vision collectors, 486 objects, 3822 videos, and 2687934 frames. On right, a graphic of a face mask with a line that connects to a picture of a cloth mask with a black and white zig-zag pattern. The line reads seven to eight videos per object. Below the face mask graphic, there are three yellow objects resembling a watering can, a key, and a comb. A line next to these reads two to ten objects per user. The objects are falling into a green bucket, with a line to the right of the bucket that reads user’s bucket.

    Announcing the ORBIT dataset: Advancing real-world few-shot learning using teachable object recognition

    Object recognition systems have made spectacular advances in recent years, but they rely on training datasets with thousands of high-quality, labelled examples per object category. Learning new objects from only a few examples could open the door to many new applications. For example, robotics manufacturing…