Azure AI milestone: New foundation model Florence v1.0 advances state of the art, topping popular computer vision leaderboards

Published

The Project Florence Team

Animated GIF shows results on several leaderboards: With the new computer vision foundation model Florence v1.0, the Project Florence team set the new state of the art on the popular leaderboards TextCaps Challenge 2021, nocaps, Kinetics-400/Kinetics-600 action classification, and OK-VQA Leaderboard. 
With the new computer vision foundation model Florence v1.0, the Project Florence team set the new state of the art on the popular leaderboards TextCaps Challenge 2021, nocaps, Kinetics-400/Kinetics-600 action classification, and OK-VQA Leaderboard. 

Florence v1.0—along with recent milestones in Neural Text-to-Speech and question answering—is part of a larger Azure AI mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work—with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, a joint representation of three cognitive attributes: monolingual text (X), audio or visual sensory signals (Y), and multilingual (Z). For more information about these efforts, read the XYZ-code blog post. 

Project Florence  was launched by Microsoft Azure Cognitive Services in May 2020 to advance its large-scale multitask, multimodal computer vision services. Today, we’re thrilled to announce an important milestone: Florence v1.0, a computer vision foundation model that successfully scales a large variety of vision and vision-language tasks.  

Florence v1.0 demonstrates superior performance on challenging tasks such as zero-shot image classification, image/text retrieval, open-set object detection, and visual question answering. We’ve achieved new state of the art with large margins on a wide range of benchmarks. Supported by Florence v1.0, we’ve also achieved the new state of the art on multiple popular vision and vision-language leaderboards, including TextCaps Challenge 2021 and Kinetics-400/Kinetics-600 action classification. Florence v1.0 is currently being deployed in Azure Cognitive Services, helping to enhance its computer vision offerings. 

A holistic, people-centered approach to AI

Project Florence is part of ongoing efforts to develop AI that operates more like people do, a journey that has been challenging but exciting. We take a holistic and people-centered approach to learning and understanding by using multimodality. Our approach examines the relationship between three attributes of human cognition—monolingual text (X), audio or visual sensory cues (Y), and multilingual (Z)—and brings them together under XYZ-code, a common representation to enable AI that can speak, hear, see, and understand better. The goal is to create pretrained basic AI models that learn common representations of different modalities and support a wide range of downstream AI tasks with the ability to leverage additional external domain knowledge to underpin AI systems that interpret and interact in the world more like people do.

In helping to advance the ambitious goal of XYZ-code, the Project Florence team achieved its first milestone last year, attaining state-of-the-art performance on the nocaps benchmark. Compared with image descriptions provided by people, captions for the same images generated by the AI system were more detailed and precise. This capability is a key component to the Microsoft mission of inclusive and accessible technology.  

From left to right, figure shows the workflow of Florence v1.0, beginning with the curation of an image-text dataset from the internet, represented by two database icons. An arrow points from the database icons to two image-text pairs. The text—“rowers carrying boat over heads on a dock” and “dog”—and their corresponding images are input into a language encoder, represented by a blue square labeled “language encoder,” and an image encoder, represented by a blue square labeled “image encoder (CoSwin),” respectively, and Florence is pretrained via unified contrastive learning, represented by a gray square labeled as such. The pretrained model is then adapted to four tasks, each one represented by a green square labeled with the task: classification/retrieval adaptation; object-level representation using a Dynamic Head Adaptor; fine-grained vision-language representation using a METER Adaptor; and video representation using a Video CoSwin Adaptor. From each green box an arrow points to an image representing the respective task. An arrow from this group of images points to a yellow square labeled “Unified Vision Stack” and another arrow points from “Unified Vision Stack” to a set of new images labeled collectively “deployment.” Learn more about Florence v1.0 in the research paper.
Florence v1.0 leverages data curation, unified learning, a Transformer architecture comprising an image encoder and a language encoder, and adaptation. It can be integrated into modern computer vision systems to power real-world vision and multimedia applications. Compared with existing image-text pretraining models, mainly limited to cross-modal shared representations for classification and retrieval (illustrated by the light-green adaptation module above), Florence expands the representation to support object detection, modalities beyond just RGB like image depth, and videos, respectively.

Florence v1.0: From research to application

Project Florence’s mission is to take the advancements being made in areas such as feature representation learning, transfer learning, and model architecture search and turn them into applications that can empower our partners and customers to achieve more with Azure Cognitive Services. Florence v1.0 and other AI breakthroughs achieved so far are being transferred to the cloud platform, helping to improve model quality for image captioning, tagging, and customized object detection. 

The Florence image captioning model is available to customers via the computer vision offering of Azure Cognitive Services, which is part of Azure AI, and can enable developers to incorporate alt text more easily, helping them improve accessibility of their own products and services. The Florence image captioning model is also being incorporated into Seeing AI, an app that identifies text, objects, and people in a user’s surroundings, and Microsoft Word, Outlook, and PowerPoint on various platforms.

The Florence image tagging model is also available through the Azure Cognitive Services computer vision offering. It’s being incorporated into OneDrive to empower the photo search and recommendation experience for millions of users.

The Florence models can be further adapted with additional customer data through model fine-tuning. This moves us closer to our ambition of “custom vision for all”—that is, providing developers and customers with tools to build and improve models customized to meet their unique needs—where new vision objects can be recognized by the Florence model with few-shot fine-tuning.

The achievements here helped pave our way toward having AI models themselves being supplied as a service in production and contribute to many ongoing projects—from Intelligent Photo for Microsoft 365 to planogram compliance for Microsoft industry clouds to Spatial Analysis for Microsoft Dynamics 365.

We’ll have more updates in the coming months. Please check out our project page to learn more about our technology and latest advancements. 

Note on Responsible AI 

Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these Microsoft AI principles into practice throughout the company and strongly encourage developers to do the same. For guidance on deploying AI responsibly, visit Responsible use of AI with Cognitive Services.

Acknowledgment 

This research was conducted by the Project Florence team under Azure Cognitive Services, in close collaboration with the Microsoft Research Deep Learning Group. Thanks to the Office of the Chief Technology Officer, Integrated Training Platform, ONNX Runtime, and DeepSpeed teams for making this great accomplishment possible. Thanks to Luis Vargas for coordination and Microsoft Research Asia for its help and collaboration. Thanks also to Jianfeng GaoBaining GuoMichael ZengYumao LuZicheng LiuCe Liu, and Xuedong Huang for their leadership and support.