Project Florence (AI)

Advancing the state-of-the-art computer vision technologies

Project Florence is a Microsoft AI Cognitive Services initiative, to advance the state-of-the-art computer vision technologies and develop the next generation framework for visual recognition.

Of the five senses, our vision system is the one human relies on most. It is estimated that 80%-85% of our perception, learning, cognition, and activities are mediated through vision which takes nature 540 million years of evolution to develop our visual intelligence system. It is thus no surprise that developing a computer system possessing the same visual intelligence is so challenging but also so desired by many real applications.

Since 2012, the breakthrough in deep learning has led to remarkable progress in visual recognition. For example, for image classification, the top-1 accuracy of 1000-class classification on ImageNet has been dramatically improved from 50.9% (before 2012) to 90.2% (in 2021). However, looking at real applications that need computer vision techniques, we still see a large gap between the current state of the art and the desired performance in real scenarios. In contrast to face detection, which has been widely used in many applications with nearly perfect precision and recall, e.g., average precision above 95% on the easy and medium sets of the very challenging benchmark, Wider Face, generic object detection can only reach a mAP (mean average precision) of 65.9% for 500 categories on Open Images, which greatly hinders its usage in mission-critical applications. The challenge gets even bigger if we move to the video space, our ultimate dream scenario, in which we also need to track different objects, manage their identities, and understand their interactions.

Motivated by the strong demand from real applications and recent research progress on feature representation learning, transfer learning, cross-modality understanding, and model architecture search, we strive to advance the state of the art and develop universal backbones with shared representations for a wide spectrum of visual categories, aiming at accelerating Microsoft vision product shipping using state-of-the-art large-scale deep learning models.