{"id":791159,"date":"2021-11-01T11:07:56","date_gmt":"2021-11-01T18:07:56","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=791159"},"modified":"2022-02-01T11:47:52","modified_gmt":"2022-02-01T19:47:52","slug":"turing-bletchley-a-universal-image-language-representation-model-by-microsoft","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-bletchley-a-universal-image-language-representation-model-by-microsoft\/","title":{"rendered":"Turing Bletchley: A Universal Image Language Representation model by Microsoft"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1024x576.jpg\" alt=\"An illustration of how the image text contrastive and translation text contrastive tasks work together to help align the space of images, English text and non-English text. On the left side of the illustration, the three domains\u2014Image Domain, English Domain, and Non-English Domain--are segregated. An arrow labeled \u201cImage-Captions training data\u201d points to another depiction of the three domains where the image domain and the English domain intersect but the non-English domain is still separate and shown in gray to show that it\u2019s not significantly affected. A two headed arrow with the label \u201cImage-Text contrastive loss\u201d is drawn between the image and English domains.  \n\nTowards the bottom of the image, an arrow labeled \u201cParallel corpus training data\u201d points to another depiction of the three domains where the English domain and the non-English domain intersect but the image domain is separate and shown in gray to indicate that it is not significantly affected. A two-headed arrow with the label \u201cTranslated Text Contrastive loss\u201d is drawn between the English and non-English domains. \n\nFinally, a third arrow with the label \u201cResulting Effect\u201d is drawn to the right of the image which points to a depiction of all three domains intersecting. \" class=\"wp-image-791447\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1536x864.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-2048x1152.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Today, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/turing.microsoft.com\/\">Microsoft Turing team<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is thrilled to introduce Turing&nbsp;Bletchley,&nbsp;a&nbsp;2.5-billion parameter Universal Image Language Representation model (T-UILR) that&nbsp;can perform image-language tasks in 94 languages. T-Bletchley has an image encoder and a universal language encoder that vectorize input image and text respectively so that semantically similar images and texts align with each other. This model shows uniquely powerful capabilities and a groundbreaking advancement in image language understanding.<\/p>\n\n\n\n<p>T-Bletchley outperforms state-of-the-art models, like&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2102.05918\">Google\u2019s ALIGN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, on English image-language data sets (ImageNet, CIFAR, and COCO), and outperforms <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.03493\">MULE,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2004.04312\">SMALR,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2006.02635\">M<sup>3<\/sup>P<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on universal image language data sets (Multi30k and COCO). To see T-Bletchley in action navigate to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/go.microsoft.com\/fwlink\/?linkid=2178904\">demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 id=\"significance-of-multi-modal-and-universal\">Significance of multi-modal and universal<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"682\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1-1024x682.jpg\" alt=\"Three sets of four images. The first set shows different views of the Eiffel Tower at night retrieved using a Korean-language query. The second set shows different images of people playing soccer in the rain, retrieved using a query in Spanish. The third set shows different views of cats interacting with computers using a Finnish-language query that translates to English as \u201ccat programming\u201d. \" class=\"wp-image-791162\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1-1024x682.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1-300x200.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1-768x512.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1-240x160.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchley-Fig1.jpg 1076w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Image showing \u201ca beautiful sunset on the beach\u201d<\/figcaption><\/figure><\/div>\n\n\n\n<p>Language and vision are inherently linked. When we hear the statement \u201ca beautiful sunset on the beach\u201d we imagine an image similar to the one above. Models that focus only on language fail to capture this link. To these models, sentences are no more than a grammatically correct sequence of words.<\/p>\n\n\n\n<p>Furthermore, vision is a global modality. The same sight of the beach sunset can be narrated in any language (<em>\u201cuna hermosa puesta de sol en la playa\u201d<\/em>, <em>\u201cun beau coucher de soleil sur la plage\u201d<\/em>, <em>\u201cMatahari terbenam yang indah di pantai\u201d<\/em>, etc.), and it would not change the corresponding visual representation. Traditional multi-modal models tie vision to a particular language (most commonly English) and therefore fail to capture this universal property of vision.<\/p>\n\n\n\n<p>With T-Bletchley, we address both these shortcomings. We take a multi-modal approach that advances a computer\u2019s ability to understand language as well as understand images natively, just from pixels. Additionally, we consider language modality with a universal-first approach when developing the model. The result is a one-of-a-kind universal multi-modal model that understands images and text across 94 different languages, resulting in some impressive capabilities. For example, by utilizing a common image-language vector space, without using any metadata or extra information like surrounding text, T-Bletchley can retrieve images that match a text description provided in any language. It can also find images that answer text-based questions in any language, or images that are semantically like another image.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"t-bletchley-in-action\">T-Bletchley in action<\/h2>\n\n\n\n<p>To test the capabilities of T-Bletchley, we built an image retrieval system consisting of 30 million randomly sampled images from the web that were unseen by the model during training.<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/go.microsoft.com\/fwlink\/?linkid=2178904\"> The images \u2013 without any captions, alt-text or other forms of text metadata \u2013 were encoded by the image encoder and stored in an index<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>We built two types of retrieval systems \u2013 text-to-image and image-to-image. We vectorized the input queries (for text-to-image with the text encoder and for image-to-image with the image encoder). We use the encoded vector as key and query the index to find its nearest neighbors in the vector space using the approximate nearest neighbor (ANN) algorithm HNSW. The nearest neighbors are then displayed as the image retrieval results.<\/p>\n\n\n\n<p>Today\u2019s image retrieval systems depend heavily on text metadata that is available for images, e.g., image captions, alt-text, surrounding text, image URL, etc. T-Bletchley is unique in that the system can do image retrieval from just the encoded image vectors and does not use any text metadata. This is a big step in true image understanding as compared to today\u2019s systems. Moreover, the demo was built directly with the pre-trained model and not finetuned with any image retrieval task.<\/p>\n\n\n\n<p>In addition, today\u2019s image retrieval systems also use object tagging algorithms applied to images which augment the text metadata (i.e., add tags like car, house, beach, etc. generated from the image). Since the object tagging systems are trained by human-labeled data, the number of classes (tags) is extremely limited. T-Bletchley is trained with unsupervised data, and as a result, it understands a very large number of objects, actions, and many other concepts (dancing, programming, racing, etc.) of the real world.<\/p>\n\n\n\n<p>Below are some examples that showcase the capabilities of T-Bletchley in an image retrieval system.<\/p>\n\n\n\n<h2 id=\"universal-text-to-image-retrieval\">Universal text-to-image retrieval<\/h2>\n\n\n\n<p>Below are examples of images retrieved using text-based queries in multiple languages:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3-1024x539.png\" alt=\"Three sets of four images. The first set shows different views of the Eiffel Tower at night retrieved using a Korean-language query. The second set shows different images of people playing soccer in the rain, retrieved using a query in Spanish. The third set shows different views of cats interacting with computers using a Finnish-language query that translates to English as \u201ccat programming\u201d. \" class=\"wp-image-791174\" width=\"900\" height=\"473\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3-1024x539.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3-300x158.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3-768x404.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3-240x126.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchlyFig3.png 1274w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/figure>\n\n\n\n<p>The third example shows that T-Bletchley &#8220;understands\u201d the act of programming and has carved out a vector subspace dedicated solely for images of cats programming. True image understanding can be used to improve current retrieval systems to place a greater weight on the image itself.<\/p>\n\n\n\n<h2 id=\"code-switched-retrieval\">Code-switched retrieval<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"339\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1-1024x339.png\" alt=\"Two sets of four images. The first set shows small airplanes landing or at rest on unpaved runways, retrieved by a query mixing English and Khmer language. The second set shows groups of people posing for a photo at the Great Wall of China, retrieved using a query mixing English and Chinese.  \" class=\"wp-image-791180\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1-1024x339.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1-300x99.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1-768x255.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1-240x80.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig3.1.png 1279w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>T-Bletchley can even retrieve images from non-English language queries written with English script!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"346\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4-1024x346.png\" alt=\"Two sets of four images. The first set shows groups of children playing cricket, retrieved using a query mixing English and English-script Hindi. The second set shows three photos and one artist\u2019s rendering of a train next to bodies of water, retrieved using a query mixing English and English-script Japanese. \" class=\"wp-image-791186\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4-1024x346.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4-300x101.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4-768x260.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4-240x81.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Bletchleyfig4.png 1290w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>T-Bletchley can understand sentences containing multiple languages and scripts:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"167\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5-1024x167.png\" alt=\"A set of four images shows a dog playing with a ball, retrieved using a query mixing English and three other languages.\" class=\"wp-image-791192\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5-1024x167.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5-300x49.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5-768x125.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5-240x39.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig5.png 1250w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 id=\"image-to-image-retrieval\">Image-to-image retrieval<\/h2>\n\n\n\n<p>To evaluate image retrieval, we encode the given image using the image encoder and retrieve the closest image vectors and corresponding images from the index. Because T-Bletchley was trained to pick the best caption for an image, it tends to prefer semantically similar images instead of visually similar ones.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6-1024x224.png\" alt=\"An image of a map of Seattle and an arrow pointing to four different images of a map of Seattle \" class=\"wp-image-791195\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6-1024x224.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6-300x66.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6-768x168.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6-240x53.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig6.png 1449w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The images retrieved by T-Bletchley are not necessarily similar in appearance to the query image. However, the images, all of the same geography, are \u2018semantically similar.\u2019 T-Bletchley does not return the following images from the retrieval set that look like the input image.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"212\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-1024x212.png\" alt=\"Four maps of different cities \" class=\"wp-image-791201\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-1024x212.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-300x62.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-768x159.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-1536x318.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7-240x50.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig7.png 1573w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 id=\"understanding-text-within-images\">Understanding text within images<\/h2>\n\n\n\n<p>T-Bletchley is also able to understand text within images without the use of OCR technologies. In the following examples, images are directly passed to the image encoder and stored as 1024 dimensional vectors, and only the cosine similarity between these vectors is used to retrieve similar images.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"508\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8-1024x508.png\" alt=\"Three sets of images. The first set shows a slide comparing microeconomics and macroeconomics. An arrow points to a series of four similar slides comparing microeconomics and macroeconomics. The second set shows a slide entitled: \u201cHOW DOES COVID-19 SPREAD?\u201d. An arrow points to a series of four similar slides explaining how COVID-19 is spread. The third set shows some French text with its English translation: diff\u00e9rence entre les donn\u00e9es primaires etsecondaires \u202f(difference between primary and secondary data in french). An arrow points to four separate images with English text referring to \u201cprimary data\u201d and \u201csecondary data\u201d. \" class=\"wp-image-791204\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8-1024x508.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8-300x149.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8-768x381.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8-240x119.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BlechleyFig8.png 1483w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>In the first example, T-Bletchley understands that the text in the image is about the differences between microeconomics and macroeconomics and retrieves similar slides. In the second example, T-Bletchley retrieves images related to COVID-19 even though T-Bletchley&#8217;s training data pre-dates COVID-19.<\/p>\n\n\n\n<p>This capability is universal\u2014it can be used in multiple languages. The examples below show retrieval in French and Arabic.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"414\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9-1024x414.png\" alt=\"An image of a text book cover titled: \u201cLa Revolution francaise\u201d, with an arrow pointing to a set of four similar images of textbooks depicting the French Revolution and other French history. \n\nAn image of a vintage advertisment for Coca-Cola with an arrow pointing to a set of four similar images, two in Arabic and two in English. \" class=\"wp-image-791207\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9-1024x414.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9-300x121.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9-768x310.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9-240x97.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig9.png 1466w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 id=\"t-bletchley-model-development\">T-Bletchley: model development<\/h2>\n\n\n\n<h3 id=\"dataset\">Dataset<\/h3>\n\n\n\n<p>T-Bletchley was trained using billions of image-caption pairs drawn from the web.<\/p>\n\n\n\n<p>Examples of the dataset are depicted below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"909\" height=\"781\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig10.png\" alt=\"A set of three captions with a related image. The first query: \u201cPhoto of Joy Yee's Noodle Kitchen - Evanston, IL, United States. Huge selection of food\u201d, points to a restaurant menu. The second caption: \u201c2016 Ford Shelby GT350R \u2013 Photo\u202fGallery\u202f\u202f-\u202fRoadandTrack.com\u201d, points to a photo of a white sports car. The third query: \u201c\u0444\u043e\u0442\u043e \u043a\u0443\u0432\u0448\u0438\u043d\u043e\u0432 \u0441 \u0432\u043e\u0434\u043e\u0439\u201d, which translates to \u201cphoto of jugs of water\u201d in Russian, points to a photo of five clear drinking glasses being filled with water. \" class=\"wp-image-791216\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig10.png 909w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig10-300x258.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig10-768x660.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/BletchleyFig10-210x180.png 210w\" sizes=\"auto, (max-width: 909px) 100vw, 909px\" \/><\/figure>\n\n\n\n<p>A large, diverse training dataset resulted in a robust model that can handle a wide variety of images. To achieve universality, we trained the model on a parallel corpus of 500 million translation pairs. These pairs were created by extracting sentences from document-aligned webpages from common crawl corpus. Adding the Translated Text Contrasted Task allowed us to create a language-agnostic vector representation of captions, which helped make the model much more universal.<\/p>\n\n\n\n<h2 id=\"model-architecture-training\">Model architecture & training<\/h2>\n\n\n\n<p>T-Bletchley consists of transformer-based image and text encoders which are both analogous to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT-large<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> architecture.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide5_graphics_edited_Light-2.jpg\" alt=\"A set of two images. The image on the left is an illustration of the image text contrastive task. Images containing a paper boat, flowers and a bird are shown to be encoded by the image encoder. Corresponding captions in different languages are shown to be encoded by the text encoder. Contrastive loss is then applied over the resulting vectors. The image on the right is an illustration of the translation text contrastive task. Sentences and their translations in different languages are encoded by the text encoder and the contrastive loss is applied over the resulting vectors. \"\/><\/figure>\n\n\n\n<p>Images and captions were independently encoded, and the model was then trained by applying a contrastive loss on the generated image and text vectors. Similarly, to create a language-agnostic representation, each sentence from a translation pair was independently encoded and a contrastive loss was applied over the resulting batch of vectors.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"535\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-1024x535.jpg\" alt=\"An illustration of how the image text contrastive and translation text contrastive tasks work together to help align the space of images, English text and non-English text. On the left side of the illustration, the three domains\u2014Image Domain, English Domain, and Non-English Domain--are segregated. An arrow labeled \u201cImage-Captions training data\u201d points to another depiction of the three domains where the image domain and the English domain intersect but the non-English domain is still separate and shown in gray to show that it\u2019s not significantly affected. A two headed arrow with the label \u201cImage-Text contrastive loss\u201d is drawn between the image and English domains.  \n\nTowards the bottom of the image, an arrow labeled \u201cParallel corpus training data\u201d points to another depiction of the three domains where the English domain and the non-English domain intersect but the image domain is separate and shown in gray to indicate that it is not significantly affected. A two-headed arrow with the label \u201cTranslated Text Contrastive loss\u201d is drawn between the English and non-English domains. \n\nFinally, a third arrow with the label \u201cResulting Effect\u201d is drawn to the right of the image which points to a depiction of all three domains intersecting. \" class=\"wp-image-791225\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-1024x535.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-300x157.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-768x401.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-1536x803.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-2048x1070.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1200x627_Bletchly_Slide1_graphics_edited_Light-1-240x125.jpg 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>In this way, despite the image caption pairs being predominantly in English, we managed to align captions in different languages with corresponding images.<\/p>\n\n\n\n<p>We leveraged the kernels in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\">DeepSpeed <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>library (compatible with PyTorch) for our transformer\u2019s implementation and the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/zero-memory-optimization-towards-training-a-trillion-parameter-models\/\">ZeRO optimizer<\/a> for training the model.<\/p>\n\n\n\n<h2 id=\"in-depth-model-evaluation\">In-depth model evaluation<\/h2>\n\n\n\n<p>T-Bletchley advances the state of the art across multiple public benchmarks.<\/p>\n\n\n\n<h3 id=\"english\">English<\/h3>\n\n\n\n<p>For this evaluation,\u00a0we followed\u00a0the prompt engineering and ensembling followed in Google\u2019s <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2102.05918\">ALIGN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> paper. T-Bletchley\u00a0outperforms\u00a0Google\u2019s ALIGN\u00a0model on English image-language benchmarks and sets a new state of the art standard in zero shot image classification, an area pioneered by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/clip\/\">OpenAI\u2019s CLIP model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Model<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>ImageNet<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>CIFAR-100<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>CIFAR-10<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>COCO R@1 <br>image -> text <\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>COCO R@1 <br>text -> image<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2102.05918\">ALIGN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><td class=\"has-text-align-center\" data-align=\"center\">76.4<\/td><td class=\"has-text-align-center\" data-align=\"center\">&#8211; <\/td><td class=\"has-text-align-center\" data-align=\"center\">&#8211;<\/td><td class=\"has-text-align-center\" data-align=\"center\">58.6<\/td><td class=\"has-text-align-center\" data-align=\"center\">45.6<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">T-Bletchley<\/td><td class=\"has-text-align-center\" data-align=\"center\">79.0<\/td><td class=\"has-text-align-center\" data-align=\"center\">83.5<\/td><td class=\"has-text-align-center\" data-align=\"center\">97.7<\/td><td class=\"has-text-align-center\" data-align=\"center\">59.1<\/td><td class=\"has-text-align-center\" data-align=\"center\">43.3<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>When fine-tuned for retrieval, T-Bletchley outperforms ALIGN, the previous state of the art, by more than two points on the COCO test set. <\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Model <\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Flickr 30k Recall @1<\/strong><br> image -> text    &nbsp &nbsp &nbsp &nbsp    text->image <\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>COCO Recall @1<\/strong><br>image ->text  &nbsp &nbsp &nbsp &nbsp   text->image<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2004.06165\">OSCAR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><td class=\"has-text-align-center\" data-align=\"center\">&#8211;                             &#8211;<\/td><td class=\"has-text-align-center\" data-align=\"center\">73.5   &nbsp &nbsp &nbsp &nbsp 57.5<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2102.05918\">ALIGN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/td><td class=\"has-text-align-center\" data-align=\"center\">95.3     &nbsp &nbsp &nbsp &nbsp  84.9<\/td><td class=\"has-text-align-center\" data-align=\"center\">77.0   &nbsp &nbsp &nbsp &nbsp  59.5<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">T-Bletchley<\/td><td class=\"has-text-align-center\" data-align=\"center\">97.1    &nbsp &nbsp &nbsp &nbsp  87.4<\/td><td class=\"has-text-align-center\" data-align=\"center\">80.2   &nbsp &nbsp &nbsp &nbsp  62.3<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>T-Bletchley achieves state-of-the-art results in English-specific tasks compared to English-only models. T-Bletchley&#8217;s English performance is not hindered by universal language support!<\/p>\n\n\n\n<h3 id=\"universal\">Universal<\/h3>\n\n\n\n<p>T-Bletchley&#8217;s universal retrieval capabilities were evaluated on the Multi30k, COCO-CN and COCO-JP datasets and compared to multilingual models. Even before fine-tuning, T-Bletchley significantly outperforms previous models.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Setting<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Model<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Multi30k<br>French        German        Czech<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>COCO<br>Chinese        Japanese <\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Zero Shot<\/td><td class=\"has-text-align-center\" data-align=\"center\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2006.02635\">M<sup>3<\/sup>P<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br>T-Bletchley<\/td><td class=\"has-text-align-center\" data-align=\"center\">27.1  &nbsp &nbsp &nbsp  36.8   &nbsp &nbsp &nbsp 20.4<br>85.0    &nbsp &nbsp &nbsp 83.2   &nbsp &nbsp &nbsp  81.2<\/td><td class=\"has-text-align-center\" data-align=\"center\">32.3   &nbsp &nbsp &nbsp 33.3<br>81.5  &nbsp &nbsp &nbsp   64.8<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>When T-Bletchley is fine-tuned, the model sets new state-of-the-art results in multiple languages, shown in the table below.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>Setting<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Model<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>Multi30k<br>French        German        Czech<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>COCO<br>Chinese        Japanese <\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Finetuned<\/td><td class=\"has-text-align-center\" data-align=\"center\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1909.03493\">MULE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2004.04312\">SMALR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2006.02635\">M<sup>3<\/sup>P<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><br>T-Bletchley<\/td><td class=\"has-text-align-center\" data-align=\"center\">62.3   &nbsp &nbsp &nbsp  64.1   &nbsp &nbsp &nbsp   57.7<br>65.9   &nbsp &nbsp &nbsp  69.8  &nbsp &nbsp &nbsp 64.8<br>73.9 &nbsp &nbsp &nbsp   82.7   &nbsp &nbsp &nbsp 72.2<br>94.6   &nbsp &nbsp &nbsp 94.3  &nbsp &nbsp &nbsp  93.6<\/td><td class=\"has-text-align-center\" data-align=\"center\">75.6   &nbsp &nbsp &nbsp 75.9<br>76.7  &nbsp &nbsp &nbsp  77.5<br>86.2    &nbsp &nbsp &nbsp  87.9<br>89.0   &nbsp &nbsp &nbsp 86.3<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 id=\"future-applications\">Future applications<\/h3>\n\n\n\n<p>The goal of T-Bletchley is to create a model that understands text and images as seamlessly as humans do. The first version of T-Bletchley represents a significant breakthrough in this mission. We expect the T-Bletchley model to improve image question and answering, image search, and image-to-image search experiences in Bing, Microsoft Office and Azure.<\/p>\n\n\n\n<p><strong>Note on Responsible AI:<\/strong>\u00a0Like other publicly available models, the Microsoft Turing models are trained with billions of pages of publicly available text\u00a0and images, and hence may have picked up biases around gender, race and more from these public documents. Mitigating negative effects from these biases is a difficult, industry-wide issue and Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/responsible-ai?activetab=pivot1%3aprimaryr6\">Microsoft AI principles<\/a>\u00a0into practice throughout the company and have taken extensive precautionary measures to prevent these implicit biases\u00a0from\u00a0getting exhibited when using the models in our products. We strongly encourage developers to do the same by putting appropriate guardrails and mitigations in place before taking these models to production.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, the Microsoft Turing team (opens in new tab) is thrilled to introduce Turing&nbsp;Bletchley,&nbsp;a&nbsp;2.5-billion parameter Universal Image Language Representation model (T-UILR) that&nbsp;can perform image-language tasks in 94 languages. T-Bletchley has an image encoder and a universal language encoder that vectorize input image and text respectively so that semantically similar images and texts align with each [&hellip;]<\/p>\n","protected":false},"author":40735,"featured_media":791447,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Saurabh Tiwary","user_id":"39603"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-791159","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[691494,649749],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"An illustration of how the image text contrastive and translation text contrastive tasks work together to help align the space of images, English text and non-English text. On the left side of the illustration, the three domains\u2014Image Domain, English Domain, and Non-English Domain--are segregated. An arrow labeled \u201cImage-Captions training data\u201d points to another depiction of the three domains where the image domain and the English domain intersect but the non-English domain is still separate and shown in gray to show that it\u2019s not significantly affected. A two headed arrow with the label \u201cImage-Text contrastive loss\u201d is drawn between the image and English domains. Towards the bottom of the image, an arrow labeled \u201cParallel corpus training data\u201d points to another depiction of the three domains where the English domain and the non-English domain intersect but the image domain is separate and shown in gray to indicate that it is not significantly affected. A two-headed arrow with the label \u201cTranslated Text Contrastive loss\u201d is drawn between the English and non-English domains. Finally, a third arrow with the label \u201cResulting Effect\u201d is drawn to the right of the image which points to a depiction of all three domains intersecting.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1536x864.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-2048x1152.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/1400x788_Bletchly_no_logo_dot_graphic-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Saurabh Tiwary","formattedDate":"November 1, 2021","formattedExcerpt":"Today, the Microsoft Turing team (opens in new tab) is thrilled to introduce Turing&nbsp;Bletchley,&nbsp;a&nbsp;2.5-billion parameter Universal Image Language Representation model (T-UILR) that&nbsp;can perform image-language tasks in 94 languages. T-Bletchley has an image encoder and a universal language encoder that vectorize input image and text respectively&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/791159","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=791159"}],"version-history":[{"count":15,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/791159\/revisions"}],"predecessor-version":[{"id":817462,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/791159\/revisions\/817462"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/791447"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=791159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=791159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=791159"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=791159"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=791159"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=791159"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=791159"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=791159"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=791159"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=791159"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=791159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}