Azure Florence group header: eye icon following a roadmap to a distant location

Azure Florence

Of the five senses, our vision system is the one humans rely on most. It is estimated that 80–85% of our perception, learning, cognition, and activities are mediated through vision. Our visual intelligence system has taken nature 540 million years of evolution to develop. It is thus no surprise that developing a computer system possessing the same visual intelligence is so challenging but also so desired by many real applications. These applications include Seeing AI, the app developed by Microsoft, which can change the lives of 285 million people who are blind or have low vision around the world by empowering their phones to tell them what they see, giving them new insights into their surroundings.

Since 2012, breakthroughs in deep learning have led to remarkable progress in visual recognition. To give an example in image classification, the top-1 accuracy of 1000-class classification on ImageNet has been dramatically improved from 50.9% (before 2012) to 88.4% (in 2020). However, looking at real applications that need computer vision techniques, we still see a large gap between the current state of the art and the desired performance in real scenarios.

In contrast to face detection, which has been widely used in many applications with nearly perfect precision and recall (average precision above 95% on the easy and medium sets of the very challenging benchmark Wilder Face), generic object detection can only reach a mean average precision of 65.9% for 500 categories on Open Images. This greatly hinders its usage in mission-critical applications. The challenge gets even bigger if we move to the video space, our ultimate dream scenario, in which we also need to track different objects, manage their identities, and understand their interactions.

Azure Florence is funded by Microsoft AI Cognitive Service team and has been funded since March 2020. Motivated by the strong demand from real applications and recent research progress on feature representation learning, transfer learning, cross-modality understanding, and model architecture search, we strive to advance the state of the art and develop the best computer vision technologies as part of our mission to empower everyone on the planet to achieve more using the best technologies.