{"id":490835,"date":"2018-06-15T08:33:32","date_gmt":"2018-06-15T15:33:32","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=490835"},"modified":"2018-06-21T10:47:20","modified_gmt":"2018-06-21T17:47:20","slug":"using-transfer-learning-to-address-label-noise-for-large-scale-image-classification","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-transfer-learning-to-address-label-noise-for-large-scale-image-classification\/","title":{"rendered":"Using transfer learning to address label noise for large-scale image classification"},"content":{"rendered":"<p>In this post, we introduce how to use transfer learning to address label noise for large-scale image classification tasks. We\u2019ll avoid describing the approach using too much math. If you are interested in the deeper theory behind this approach, please refer to our paper, \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.07131\">CleanNet: Transfer learning for scalable image classifier training with label noise<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/cvpr2018.thecvf.com\/program\/main_conference\">CVPR 18<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in Salt Lake City, Utah.<\/p>\n<p>One of the key factors driving recent advances in image classification is large collections of labeled images such as ImageNet and MS-COCO. However, building a large set of human-labeled images is extremely expensive and time-consuming. In practice, web-labeled data is much cheaper. One common way to get web-labeled images is scraping from image search. We can define a set of queries for a category and use them to retrieve images belonging to the category. For example, we use \u201ctuna\u201d and \u201cmaguro\u201d (the Japanese word for tuna) as text queries for getting images for the category \u201ctuna\u201d). Of course, image search is not perfect and there will be label noise, as you can see from the results shown in Figure 1.<\/p>\n<div id=\"attachment_491690\" style=\"width: 1817px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-491690\" class=\"wp-image-491690 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/Figure1_newimage_bingimagesformaguro.jpg\" alt=\"\" width=\"1807\" height=\"1089\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/Figure1_newimage_bingimagesformaguro.jpg 1807w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/Figure1_newimage_bingimagesformaguro-300x181.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/Figure1_newimage_bingimagesformaguro-768x463.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/Figure1_newimage_bingimagesformaguro-1024x617.jpg 1024w\" sizes=\"auto, (max-width: 1807px) 100vw, 1807px\" \/><p id=\"caption-attachment-491690\" class=\"wp-caption-text\">Figure 1 \u2013 Bing image search for \u201cmaguro\u201d (the Japanese word for tuna). Mislabeled images are marked with a red x.<\/p><\/div>\n<p>Other methods, such as using user-provided tags (for example, Flickr lets users tag images) encounter the same label noise issue. As you can imagine, label noise can affect the accuracy of the induced classifiers significantly because it confuses the classifier with false information. It is therefore important to develop algorithms for learning in presence of label noise. In practice, we hope to have methods that use the least human supervision possible because manual labeling is expensive.<\/p>\n<p>Previously proposed methods can be categorized in one of two ways: those relying on human supervision and those free of any human supervision.<\/p>\n<p><strong>Human supervision methods<\/strong><\/p>\n<p>One of the most common ways this works in practice is having human labelers verify the class labels for a subset of images while learning to discriminate mislabeled images from such verification (verification is a kind of manual labeling.) Let\u2019s say we have a category called \u201cwaffle\u201d and collect images of waffles from a noisy source. We can sample all images belonging to the category waffle, manually identify correctly-labeled and incorrectly-labeled image samples and train a model such as a K-NN classifier to propagate the verification to other images of waffles, as demonstrated in Figure 2.<\/p>\n<div id=\"attachment_490844\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-490844\" class=\"wp-image-490844 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig2_manually-identified-1024x488.png\" alt=\"\" width=\"1024\" height=\"488\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig2_manually-identified-1024x488.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig2_manually-identified-300x143.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig2_manually-identified-768x366.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-490844\" class=\"wp-caption-text\">Figure 2 \u2013 Learning model manually identifies incorrectly-labeled images among images in the category \u201cwaffle\u201d and uses it to predict the correctness of class labels of other \u201cwaffle\u201d images.<\/p><\/div>\n<p>Some of the large-scale datasets like <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/lsun.cs.princeton.edu\/2017\/\">LSUN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/places.csail.mit.edu\/\">Places<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> adopted similar ideas for dataset construction to reduce the cost. There is one problem \u2013 you have to do this for all categories! This represents a huge disadvantage in scalability. Let\u2019s say you want to train an image classifier in 10,000 categories. Verifying 50 images per class is 500,000 manual labels. For classification tasks with millions of classes (this is called extreme classification), it is not remotely feasible to have even one image verified per class.<\/p>\n<p><strong>Sans human supervision methods<\/strong><\/p>\n<p>A popular approach here is unsupervised outlier removal. This assumes that outliers are mislabeled samples that should be removed and detects outliers with methods like One-Class SVM [1]. However, outliers often are not well defined and therefore removing them presents a challenge. Sometimes outliers are just hard examples that have great value in learning. Being free of human supervision, these methods are certainly scalable but often less effective and more heuristic.<\/p>\n<p>Going with either of these existing approaches, either all the classes or none needs to be manually verified, In other words, it\u2019s difficult to have both scalability and effectiveness.<\/p>\n<p>So we strived to reconcile this gap. We wanted the benefit of human supervision but not to have to do this for every category. When asked, what about making this a domain adaptation task and using transfer learning to solve it? More specifically, we wanted to verify images for some classes and transfer the knowledge of \u201cWhat could be label noise?\u201d to other classes. The task would look like that depicted in Figure 3.<\/p>\n<div id=\"attachment_490847\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-490847\" class=\"wp-image-490847 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig3_makinglabelnoisedetectionadomain-1024x487.png\" alt=\"\" width=\"1024\" height=\"487\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig3_makinglabelnoisedetectionadomain-1024x487.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig3_makinglabelnoisedetectionadomain-300x143.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig3_makinglabelnoisedetectionadomain-768x365.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig3_makinglabelnoisedetectionadomain.png 1431w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-490847\" class=\"wp-caption-text\">Figure 3 \u2013 Making label noise detection a domain adaptation task to achieve scalability and effectiveness.<\/p><\/div>\n<p>In this way, the scale verification labeling effort doesn\u2019t grow with number of classes because we don\u2019t need to verify images for every class. Thanks to recent advances in transfer learning, we can now use deep neural network to solve this problem.<\/p>\n<h3><strong>Our Approach<\/strong><\/h3>\n<p>We observed that one of the key ideas in learning to discriminate label noise is identifying class prototypes to effectively represent classes. Look at Figure 4 and think about the paradigm mentioned earlier in Figure 2 that uses human supervision to address label noise.<\/p>\n<div id=\"attachment_490859\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-490859\" class=\"wp-image-490859 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/identifyingcorrectlylabeled-1024x488.png\" alt=\"\" width=\"1024\" height=\"488\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/identifyingcorrectlylabeled-1024x488.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/identifyingcorrectlylabeled-300x143.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/identifyingcorrectlylabeled-768x366.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-490859\" class=\"wp-caption-text\">Figure 4 \u2013 Identifying correctly-labeled samples is about finding representative class prototypes that can describe the visual appearance of a category.<\/p><\/div>\n<p>Think of identifying correctly-labeled images in a noisy set of images collected for a category as finding representative \u201cclass prototypes\u201d that can describe the visual appearance of the category. Inspired by this observation, we developed an attention mechanism that learns how to select class prototypes from a noisy set of images collect for a category (we call such an image set a \u201creference set\u201d of a category.) Given any reference set, we want our model to use attention to find class prototypes that even the model wasn\u2019t specifically trained for with human supervision for the class that the reference set represents.<\/p>\n<p>With the idea of using attention to select class prototypes, we designed a joint-embedding network as illustrated in Figure 5.<\/p>\n<div id=\"attachment_490853\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-490853\" class=\"wp-image-490853 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig5-1024x488.png\" alt=\"\" width=\"1024\" height=\"488\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig5-1024x488.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig5-300x143.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/fig5-768x366.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-490853\" class=\"wp-caption-text\">Figure 5 \u2013 <span style=\"margin: 0px; color: #333333; line-height: 107%; font-family: 'Georgia',serif; font-size: 12pt;\">Joint embedding network architecture<\/span><\/p><\/div>\n<p>On one side, the reference set encoder uses attention to select class prototypes from a class reference set and encodes the information into an \u201cclass embedding\u201d vector. On the other side, the query encoder encodes a query image into a query embedding.<\/p>\n<p>The distance between a class embedding and a query embedding is intended to indicate whether the query image belongs to the class of the reference set. Let\u2019s say we have an image whose class label is \u201cwaffle\u201d but we want to verify this. We can feed this image to query an image encoder to produce the query embedding and compare it with the class embedding produced by the reference set encoder. High similarity implies the class label is correct; low similarity implies it is mislabeled.<\/p>\n<p>To achieve the intended effect, we apply matching constraint during training to push up the similarity between manually identified correctly-labeled query images and the corresponding class embedding vectors. Otherwise, we demote the similarity. As part of the reference set encoder, the attention component learning is driven by the supervision from query images.<\/p>\n<p>The reference set encoder doesn\u2019t have to learn from human supervision for all classes \u2013 it learns to adapt to different reference sets that represent different classes. In this way, we don\u2019t need to provide human supervision for all categories in training, but we are still able to address label noise for all categories.<\/p>\n<h3><strong>Where do we apply this model?<\/strong><\/h3>\n<p>Our method can be applied to label noise detection and learning classification from noisy data.<\/p>\n<p><strong>Label noise detection<\/strong><\/p>\n<p>This model is an effective tool for label noise detection. Simple thresholding based on the similarity between the reference set and the query image can lead to good results compared with existing methods. Note that label noise detection not only is useful for training image classifiers with noisy data, but also has important values in applications like image search result filtering and linking images to knowledge graph entities.<\/p>\n<p><strong>Learning classification from noisy data<\/strong><\/p>\n<p>This model predicts the relevance of an image to its noisy class label. Therefore, we\u2019re proposing to use it to assign weights to image samples according to the image-to-label relevance to guide training of the image classifier.<\/p>\n<p>In our experiments for label noise detection and classification learning, our method outperforms those using no human supervision by a large margin when a small fraction of classes is manually verified. It also matches existing methods that require extensive human supervision when sufficient classes are manually verified.<\/p>\n<h3><strong>Looking ahead<\/strong><\/h3>\n<p>We believe this work creates a novel paradigm that efficiently utilizes human supervision to better address label noise in large-scale image classification tasks. And we\u2019re very excited about the benefits that this work can bring to our image recognition products. In the near future, we also expect this paradigm to impact other fields, such as text and video.<\/p>\n<p>We encourage you to learn more about it by reading our paper, \u201c<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.07131\">CleanNet: Transfer learning for scalable image classifier training with label noise<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u201d.<\/p>\n<p><strong>References<\/strong><br \/>\n[1] Sch\u00f6lkopf, Bernhard, et al. &#8220;Estimating the support of a high-dimensional distribution.&#8221; Neural computation 13.7 (2001): 1443-1471.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, we introduce how to use transfer learning to address label noise for large-scale image classification tasks. We\u2019ll avoid describing the approach using too much math. If you are interested in the deeper theory behind this approach, please refer to our paper, \u201cCleanNet: Transfer learning for scalable image classifier training with label noise,\u201d [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":491237,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Lei Zhang","user_id":"32641"},{"type":"user_nicename","value":"Kuang-Huei Lee","user_id":"37568"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194471,194474],"tags":[],"research-area":[13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-490835","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-vision","category-data-visulalization","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[488849],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"480\" height=\"280\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/CVPRCleanNet_CV_Carousel_06_2018_480x280.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/CVPRCleanNet_CV_Carousel_06_2018_480x280.jpg 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/CVPRCleanNet_CV_Carousel_06_2018_480x280-300x175.jpg 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/>","byline":"Lei Zhang and Kuang-Huei Lee","formattedDate":"June 15, 2018","formattedExcerpt":"In this post, we introduce how to use transfer learning to address label noise for large-scale image classification tasks. We\u2019ll avoid describing the approach using too much math. If you are interested in the deeper theory behind this approach, please refer to our paper, \u201cCleanNet:&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/490835","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=490835"}],"version-history":[{"count":18,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/490835\/revisions"}],"predecessor-version":[{"id":492110,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/490835\/revisions\/492110"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/491237"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=490835"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=490835"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=490835"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=490835"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=490835"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=490835"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=490835"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=490835"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=490835"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=490835"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=490835"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}