Using transfer learning to address label noise for large-scale image classification


In this post, we introduce how to use transfer learning to address label noise for large-scale image classification tasks. We’ll avoid describing the approach using too much math. If you are interested in the deeper theory behind this approach, please refer to our paper, “CleanNet: Transfer learning for scalable image classifier training with label noise,” presented at CVPR 18 in Salt Lake City, Utah.

One of the key factors driving recent advances in image classification is large collections of labeled images such as ImageNet and MS-COCO. However, building a large set of human-labeled images is extremely expensive and time-consuming. In practice, web-labeled data is much cheaper. One common way to get web-labeled images is scraping from image search. We can define a set of queries for a category and use them to retrieve images belonging to the category. For example, we use “tuna” and “maguro” (the Japanese word for tuna) as text queries for getting images for the category “tuna”). Of course, image search is not perfect and there will be label noise, as you can see from the results shown in Figure 1.

Figure 1 – Bing image search for “maguro” (the Japanese word for tuna). Mislabeled images are marked with a red x.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Other methods, such as using user-provided tags (for example, Flickr lets users tag images) encounter the same label noise issue. As you can imagine, label noise can affect the accuracy of the induced classifiers significantly because it confuses the classifier with false information. It is therefore important to develop algorithms for learning in presence of label noise. In practice, we hope to have methods that use the least human supervision possible because manual labeling is expensive.

Previously proposed methods can be categorized in one of two ways: those relying on human supervision and those free of any human supervision.

Human supervision methods

One of the most common ways this works in practice is having human labelers verify the class labels for a subset of images while learning to discriminate mislabeled images from such verification (verification is a kind of manual labeling.) Let’s say we have a category called “waffle” and collect images of waffles from a noisy source. We can sample all images belonging to the category waffle, manually identify correctly-labeled and incorrectly-labeled image samples and train a model such as a K-NN classifier to propagate the verification to other images of waffles, as demonstrated in Figure 2.

Figure 2 – Learning model manually identifies incorrectly-labeled images among images in the category “waffle” and uses it to predict the correctness of class labels of other “waffle” images.

Some of the large-scale datasets like LSUN and Places adopted similar ideas for dataset construction to reduce the cost. There is one problem – you have to do this for all categories! This represents a huge disadvantage in scalability. Let’s say you want to train an image classifier in 10,000 categories. Verifying 50 images per class is 500,000 manual labels. For classification tasks with millions of classes (this is called extreme classification), it is not remotely feasible to have even one image verified per class.

Sans human supervision methods

A popular approach here is unsupervised outlier removal. This assumes that outliers are mislabeled samples that should be removed and detects outliers with methods like One-Class SVM [1]. However, outliers often are not well defined and therefore removing them presents a challenge. Sometimes outliers are just hard examples that have great value in learning. Being free of human supervision, these methods are certainly scalable but often less effective and more heuristic.

Going with either of these existing approaches, either all the classes or none needs to be manually verified, In other words, it’s difficult to have both scalability and effectiveness.

So we strived to reconcile this gap. We wanted the benefit of human supervision but not to have to do this for every category. When asked, what about making this a domain adaptation task and using transfer learning to solve it? More specifically, we wanted to verify images for some classes and transfer the knowledge of “What could be label noise?” to other classes. The task would look like that depicted in Figure 3.

Figure 3 – Making label noise detection a domain adaptation task to achieve scalability and effectiveness.

In this way, the scale verification labeling effort doesn’t grow with number of classes because we don’t need to verify images for every class. Thanks to recent advances in transfer learning, we can now use deep neural network to solve this problem.

Our Approach

We observed that one of the key ideas in learning to discriminate label noise is identifying class prototypes to effectively represent classes. Look at Figure 4 and think about the paradigm mentioned earlier in Figure 2 that uses human supervision to address label noise.

Figure 4 – Identifying correctly-labeled samples is about finding representative class prototypes that can describe the visual appearance of a category.

Think of identifying correctly-labeled images in a noisy set of images collected for a category as finding representative “class prototypes” that can describe the visual appearance of the category. Inspired by this observation, we developed an attention mechanism that learns how to select class prototypes from a noisy set of images collect for a category (we call such an image set a “reference set” of a category.) Given any reference set, we want our model to use attention to find class prototypes that even the model wasn’t specifically trained for with human supervision for the class that the reference set represents.

With the idea of using attention to select class prototypes, we designed a joint-embedding network as illustrated in Figure 5.

Figure 5 – Joint embedding network architecture

On one side, the reference set encoder uses attention to select class prototypes from a class reference set and encodes the information into an “class embedding” vector. On the other side, the query encoder encodes a query image into a query embedding.

The distance between a class embedding and a query embedding is intended to indicate whether the query image belongs to the class of the reference set. Let’s say we have an image whose class label is “waffle” but we want to verify this. We can feed this image to query an image encoder to produce the query embedding and compare it with the class embedding produced by the reference set encoder. High similarity implies the class label is correct; low similarity implies it is mislabeled.

To achieve the intended effect, we apply matching constraint during training to push up the similarity between manually identified correctly-labeled query images and the corresponding class embedding vectors. Otherwise, we demote the similarity. As part of the reference set encoder, the attention component learning is driven by the supervision from query images.

The reference set encoder doesn’t have to learn from human supervision for all classes – it learns to adapt to different reference sets that represent different classes. In this way, we don’t need to provide human supervision for all categories in training, but we are still able to address label noise for all categories.

Where do we apply this model?

Our method can be applied to label noise detection and learning classification from noisy data.

Label noise detection

This model is an effective tool for label noise detection. Simple thresholding based on the similarity between the reference set and the query image can lead to good results compared with existing methods. Note that label noise detection not only is useful for training image classifiers with noisy data, but also has important values in applications like image search result filtering and linking images to knowledge graph entities.

Learning classification from noisy data

This model predicts the relevance of an image to its noisy class label. Therefore, we’re proposing to use it to assign weights to image samples according to the image-to-label relevance to guide training of the image classifier.

In our experiments for label noise detection and classification learning, our method outperforms those using no human supervision by a large margin when a small fraction of classes is manually verified. It also matches existing methods that require extensive human supervision when sufficient classes are manually verified.

Looking ahead

We believe this work creates a novel paradigm that efficiently utilizes human supervision to better address label noise in large-scale image classification tasks. And we’re very excited about the benefits that this work can bring to our image recognition products. In the near future, we also expect this paradigm to impact other fields, such as text and video.

We encourage you to learn more about it by reading our paper, “CleanNet: Transfer learning for scalable image classifier training with label noise”.

[1] Schölkopf, Bernhard, et al. “Estimating the support of a high-dimensional distribution.” Neural computation 13.7 (2001): 1443-1471.