Announcing the ORBIT dataset: Advancing real-world few-shot learning using teachable object recognition

Published

Object recognition systems have made spectacular advances in recent years, but they rely on training datasets with thousands of high-quality, labelled examples per object category. Learning new objects from only a few examples could open the door to many new applications. For example, robotics manufacturing requires a system to quickly learn new parts, while assistive technologies need to be adapted to the unique needs and abilities of every individual.

Few-shot learning aims to reduce these demands by training models that can recognize completely novel objects from only a few examples, say 1 to 10. In particular, meta-learning algorithms—which ‘learn to learn’ using episodic training—are a promising approach to significantly reduce the number of training examples needed to train a model. However, most research in few-shot learning has been driven by benchmark datasets that lack the high variation that applications face when deployed in the real world. 

In partnership with City, University of London, we introduce the ORBIT dataset and few-shot benchmark for learning new objects from only a few, high-variation examples to close this gap. The dataset and benchmark set a new standard for evaluating machine learning models in few-shot, high-variation learning scenarios, which will help to train models for higher performance in real-world scenarios. This work is done in collaboration with a multi-disciplinary team, including Simone Stumpf, Lida Theodorou, and Matthew Tobias Harris from City, University of London and Luisa Zintgraf from University of Oxford. The work was funded by Microsoft AI for Accessibility. You can read more about the ORBIT research project and its goal to make AI more inclusive of people with disabilities in this AI Blog post.

You can learn more about the work in our research papers: “ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition,” published at the International Conference of Computer Vision (ICCV 2021), and “Disability-first Dataset Creation: Lessons from Constructing a Dataset for Teachable Object Recognition with Blind and Low Vision Data Collectors,” published at the 23rd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2021).

You’re also invited to join Senior Researcher Daniela Massiceti for a talk about the ORBIT benchmark dataset and harnessing few-shot learning for teachable AI at the first Microsoft Research Summit. Massiceti will be presenting “Bucket of me: Using few-shot learning to realize teachable AI systems” as part of the Responsible AI track on October 19. To view the presentation on demand, register at the Research Summit event page.

The ORBIT benchmark dataset contains 3,822 videos of 486 objects recorded by 77 people who are blind or low vision using their mobile phones—a total of 2,687,934 frames. Code for loading the dataset, computing benchmark metrics, and running baselines is available at the ORBIT dataset GitHub page.

On left, text reads ORBIT benchmark dataset: 77 blind and low-vision collectors, 486 objects, 3822 videos, and 2687934 frames. On right, a graphic of a face mask with a line that connects to a picture of a cloth mask with a black and white zig-zag pattern. The line reads seven to eight videos per object. Below the face mask graphic, there are three yellow objects resembling a watering can, a key, and a comb. A line next to these reads two to ten objects per user. The objects are falling into a green bucket, with a line to the right of the bucket that reads user’s bucket.
Figure 1: The ORBIT dataset and few-shot benchmark is being introduced to drive innovation in learning new objects from only a few, high-variation examples, setting a new standard for evaluating machine learning models for real-world deployment.

Inspired by teachable object recognizers

The ORBIT dataset and benchmark are inspired by a real-world application for the blind and low-vision community: teachable object recognizers. These allow a person to teach a system to recognize objects that may be important for them by capturing just a few short videos of those objects. These videos are then used to train an object recognizer that is personalized. This would allow a person who is blind to teach the object recognizer their house keys or favorite shirt, and then recognize them with a phone. Such objects cannot be identified by typical object recognizers as they are not included in common object recognition training datasets.

Teachable object recognition is an excellent example of a few-shot, high-variation scenario. It’s few-shot because people can only capture a handful of short videos recorded to “teach” a new object. Most current machine learning models for object recognition require thousands of images to train. It’s not feasible to have people submit videos at that scale, which is why few-shot learning is so important when people are teaching object recognizers from their own videos. It’s high-variation because each person has only a few objects, and the videos they capture of these objects will vary in quality, blur, centrality of object, and other factors as shown in Figure 2.

Two rows of images that were submitted by users. Top: an off-center image of a light blue surgical mask and a hand touching the left ear loop, an upside-down blue and bright pink pet brush in the upper left of the frame, an image of a set of gold keys that are partially cut off in the frame, a teal watering can shot at a sharp angle with a hand in the foreground. Bottom: a partial image of a set of wall hooks full of clothes and other miscellaneous items including the surgical mask, a black countertop with the blue and bright pink pet brush in the center of the frame with partial images of a cereal bowl, a bag of bananas, and a beige bag; a blurry image of the gold keys on a bed with towels, clothing and a book all cropped; an overhead view of the teal watering can and partial images of plants on a brick patio.
Figure 2: Images from the ORBIT dataset, illustrating the high variation embodied in user-submitted videos (for example, blur, objects not in the center of the image, and objects appearing sideways or upside down)

Human-centric benchmark for teachable object recognition

While datasets are fundamental for driving innovation in machine learning, good metrics are just as important in helping researchers evaluate their work in realistic settings. Grounded in this challenging, real-world scenario, we propose a benchmark on the ORBIT dataset. Unlike typical computer vision benchmarks, performance on the teachable object recognition benchmark is measured based on input from each user.

This means that the trained machine learning model is given just the objects and associated videos for a single user, and it is evaluated by how well it can recognize that user’s objects. This process is done for each user in a set of test users. The result is a suite of metrics that more closely captures how well a teachable object recognizer would work for a single user in the real world.

Three line graphs show accuracy of few-shot learning models on existing benchmarks – first, Omniglot (Lake et al. 2015, Vinyals et al. 2017), second Mini-imagenet (Vinyals et al. 2017), and third Meta-Dataset (Triantafillou et al. 2019). The trend shows how few-shot classification accuracy on all 3 benchmarks has rapidly increased over the last 5 years and is nearing saturation today: on Omniglot, accuracy is now above 99%, on Mini-Image Net above 90%, and on Meta-Dataset above 75%.
Figure 3: Performance on highly cited few-shot learning models is saturated on existing benchmarks.

Evaluations on highly cited few-shot learning models show that there is significant scope for innovation in high-variation, few-shot learning. Despite saturation of model performance on existing few-shot benchmarks, few-shot models only achieve 50-55% accuracy on the teachable object recognition benchmark. Moreover, there is a high variance between users. These results illustrate the need to make algorithms more robust to high-variation (or “noisy”) data.

A table that shows model performance on ORBIT dataset benchmark. ProtoNets (Snell et al., 2017) achieves a frame accuracy of 50.3 percent with a 1.4 percent confidence interval, and requires 3.53 times ten to the twelfth power Multiply-Accumulate operations or MACs to personalize. CNAPs (Requiema et al., 2019) achieves a frame accuracy of 51.5 percent with a 1.8 percent confidence interval, and requires 3.87 times ten to the twelfth power MACs to personalize. MAML (Finn et al., 2017) achieves a frame accuracy of 51.7 percent with a 1.9 percent confidence interval, and requires 82.39 times ten to the twelfth power MACs to personalize. FineTuner (Chen et al., 2020) achieves a frame accuracy of 53.7 percent with a 1.8 percent confidence interval, and requires 274.62 times ten to the twelfth power MACs to personalize.
Figure 4: There remains significant scope for innovation with highly-cited, few-shot models only reaching 55% accuracy on the ORBIT dataset benchmark: [1] Snell et al. (2017); [2] Requiema et al. (2019); [3] Finn et al. (2017); [4] Chen et al. (2020).

Research to realize human-AI collaboration

Creating teachable object recognizers presents challenges for machine learning beyond object recognition. One example of a challenge posed by a human-centric task formulation is the need for the model to provide feedback to users about the data they provided when training in a new personal object. Is it enough data? Is it good-quality data? Uncertainty quantification is an area of machine learning that can contribute to solving this challenge.

Moreover, the challenges in building teachable object recognition systems go beyond machine learning algorithmic improvements, making it an area ripe for multi-disciplinary teams. Designing the feedback of the model to help users become better teachers requires a great deal of subtlety in user interaction. Supporting the adaptation of models to run on resource-constrained devices such as mobile phones is also a significant engineering task.

In summary, the ORBIT dataset and benchmark provide a rich playground to drive research in approaches that are more robust to few-shot, high-variation conditions, a step beyond existing curated vision datasets and benchmarks. In addition to the ORBIT benchmark, the dataset can be used to explore a wide set of other real-world recognition tasks. We hope that these contributions will not only have real-world impact by shaping the next generation of recognition tools for the blind and low-vision community, but also improve the robustness of computer vision systems across a broad range of other applications.