{"id":728458,"date":"2021-02-25T10:07:20","date_gmt":"2021-02-25T18:07:20","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=728458"},"modified":"2021-02-25T10:10:23","modified_gmt":"2021-02-25T18:10:23","slug":"hexa-self-supervised-pretraining-with-hard-examples-improves-visual-representations","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/hexa-self-supervised-pretraining-with-hard-examples-improves-visual-representations\/","title":{"rendered":"HEXA: Self-supervised pretraining with hard examples improves visual representations"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_no_logo_animation-1.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>Humans perceive the world through observing a large number of visual scenes around us and then effectively generalizing\u2014in other words, interpreting and identifying scenes they haven\u2019t encountered before\u2014without heavily relying on labeled annotations for every single scene. One of the core aspirations in artificial intelligence is to develop algorithms and techniques that endow computers with a strong generalization ability to learn only from raw pixel data to make sense of the visual world, which aligns more closely with how humans process visual information.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">BlOG<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deep-infomax-learning-good-representations-through-mutual-information-maximization\/\" data-bi-cN=\"Deep InfoMax: Learning good representations through mutual information maximization\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Deep InfoMax: Learning good representations through mutual information maximization<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Currently, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deep-infomax-learning-good-representations-through-mutual-information-maximization\/\">self-supervised pretraining <\/a>(SSP) is rising as an emerging research field, showing great success in approaching this problem. The goal of SSP is to learn general-purpose intermediate representations, with the expectation that the representations carry rich semantic or structural meanings and can be beneficial to a variety of practical downstream tasks.<\/p>\n\n\n\n<p>SSP methods rely heavily on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/hazyresearch.stanford.edu\/data-aug-part-1\">data augmentation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to create different views of an image using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/pytorch.org\/docs\/stable\/torchvision\/transforms.html\">image transformations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, such as random crop (with flip and resize), color distortion, and Gaussian blur. However, image transformations are agnostic to the pretraining objectives, and it remains unknown how to augment views specifically based on the pretraining tasks themselves or how they affect the generalization of the learned models.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/self-supervised-pre-training-with-hard-examples-improves-visual-representations\/\" data-bi-cN=\"Self-supervised Pre-training with Hard Examples Improves Visual Representations.\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Self-supervised Pre-training with Hard Examples Improves Visual Representations.<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>In this blog post, we first review SSP as a problem of predicting pseudo-labels\u2014labels derived from the given data itself rather than with the help of human annotations. Based on this problem, we describe how to generate hard examples (HEXA), a family of augmented views whose pseudo-labels are difficult to predict. This leads to a novel SSP framework, and we develop two novel algorithms that create state-of-the-art performance on well-established vision tasks based on this framework. Please check out our paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/self-supervised-pre-training-with-hard-examples-improves-visual-representations\/\">Self-supervised Pre-training with Hard Examples Improves Visual Representations<\/a>,&#8221; which elaborates on the details presented in this blog post.<\/p>\n\n\n\n<h2 id=\"self-supervision-a-pseudo-label-view\">Self-supervision: A pseudo-label view<\/h2>\n\n\n\n<p>The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/lilianweng.github.io\/lil-log\/2019\/11\/10\/self-supervised-learning.html\">self-supervised task<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (also known as pretext task) leverages and exploits a variety of different weak signals existing intrinsically in images as pseudo-labels, maximizing the agreement between pseudo-labels and the learned representations. (These weak signals often come with the data for free.) For example, we might rotate images at random and train a model to predict how each input image is rotated.<\/p>\n\n\n\n<p>The rotation prediction task is made-up, so the actual accuracy is unimportant, like the treatment of auxiliary tasks where we are concerned mostly with downstream task recognition accuracy. We expect the model to learn high-quality representations for real-world tasks, such as constructing an object recognition classifier with very few labeled samples. Most state-of-the-art SSP methods focus on designing novel pretext objectives, broadly categorized into two types:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Type 1:<\/strong> <strong>contrastive learning.<\/strong> As a recently popular concept, contrastive learning is a framework that learns representations by maximizing agreement between differently augmented views of the same image via a contrastive loss in the latent space. In the instance discrimination pretext task, a binary pseudo-label is constructed: a positive pair is formed if both views are data-augmented versions of the same image, and if this is not the case, a negative pair is formed. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/moco\">MoCo <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>(or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/moco\">MoCo-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/google-research\/simclr\">SimCLR <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>are two known studies in this line, differing in how negative samples are maintained. OpenAI\u2019s <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/blog\/clip\/\">CLIP <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>is also based on contrastive learning, but it\u2019s built for image-text pairs collected from the web.<\/li><li><strong>Type 2: prototype learning.<\/strong> As probably one of the oldest unsupervised learning methods, prototype learning introduces a \u201cprototype\u201d as the centroid for a cluster formed by similar image views. The latent representations are fed into a clustering algorithm to produce the cluster assignments, which are subsequently used as \u201cpseudo-labels\u201d to supervise model update for representation learning. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/deepcluster\">DeepCluster <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>(or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/swav\">DeepCluster-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) is a representative prototype learning work, alternating between two steps: feature clustering using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">K-means<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and feature learning by predicting these pseudo-labels.<\/li><\/ul>\n\n\n\n<p>Some recent works, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2005.04966\">Swapping Assignments between Views<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (SwAV) and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2005.04966\">Prototypical Contrastive Learning <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>(PCL), combine these two types.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1141385\">\n\t\t\n\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/ai.azure.com\/labs\" aria-label=\"Azure AI Foundry Labs\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Azure-AI-Foundry_1600x900.jpg\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Azure AI Foundry Labs<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"azure-ai-foundry-labs\" class=\"large\">Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/ai.azure.com\/labs\" aria-describedby=\"azure-ai-foundry-labs\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Azure AI Foundry Labs\" target=\"_blank\">\n\t\t\t\t\t\t\tAzure AI Foundry\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"hard-examples-as-pretraining-data\">Hard examples as pretraining data<\/h2>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/hazyresearch.stanford.edu\/data-aug-part-1\">Data augmentation (DA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> plays a vital role in SSP. Most existing methods synthesize views through random image transformations, without explicitly considering how difficult it is for the generated pseudo-labels to be predicted correctly by the model being pretrained. By contrast, we focus on studying hard examples, which are defined as augmented views whose pseudo-labels are difficult to predict. In Figure 1, we visually illustrate how hard examples are constructed from image transformations, detailing the derivation process below. Specifically, we consider two DA schemes of generating hard examples: adversarial examples and cut-mixed examples.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"229\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure1_Hexa_Blog.png\" alt=\"graphical user interface, application\" class=\"wp-image-728461\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure1_Hexa_Blog.png 624w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure1_Hexa_Blog-300x110.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure1_Hexa_Blog-16x6.png 16w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><figcaption>Figure 1: Illustration of the process that hard examples are constructed from image transformations. For the original dog image in (a), existing SSP methods employ random transformations to generate augmented views in (b). We propose two types of hard examples: (c) adversarial examples that add perturbations on views and (d) cut-mixed examples that cut and paste patches between views.<\/figcaption><\/figure><\/div>\n\n\n\n<p><strong>Adversarial examples<\/strong>. The adversarial examples are produced by adding perturbations on transformation-augmented views to fool the predicted pseudo-labels of a trained model by the most possible. By definition, adversarial examples tend to be \u201charder\u201d than the original view for model prediction. Though adversarial examples in Figure 1c look visually indistinguishable with original views in Figure 1b, their corresponding pseudo-labels have been revised significantly, depending on how many perturbations are added.<\/p>\n\n\n\n<p><strong>Cut-mixed examples. <\/strong>Patches are cut and pasted among images to create a new example, where the ground truth pseudo-labels are also mixed proportionally to the area of the patches. For example, by mixing a dog image and a cat image in Figure 1a, the cut-mixed examples are \u201cdog-cat\u201d images shown in Figure 1d. One might imagine that cut-mixed examples are confusing for models (as the same goes for humans) since their contents are mixed from two sources.<\/p>\n\n\n\n<p>In addition to standard image transformations, we add the adversarial and cut-mixed examples in the pretraining pipeline.<\/p>\n\n\n\n<h2 id=\"hexa-algorithms\">HEXA algorithms<\/h2>\n\n\n\n<p>With hard examples introduced as a new data augmentation scheme for SSP, we develop two novel algorithms to showcase how to construct them for different types of pretraining methods:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>HEXA<sub>MoCo<\/sub>. <\/strong>Based on MoCo-v2, the adversarial examples are produced for queries to flip the binary pseudo-label, the cut-mixed examples are performed between queries (the binary labels are mixed proportionally as well), and they leave keys unchanged.<\/li><li><strong>HEXA<sub>Dcluster<\/sub>.<\/strong> Based on DeepCluster-v2, the adversarial examples are produced for image views to fool the model to predict a wrong cluster index, while the cut-mixed examples are performed between different views. (The cluster indices are mixed proportionally as well.)<\/li><\/ul>\n\n\n\n<p>Both algorithms share a common goal: harder examples lie closer to the decision boundary, providing stronger learning signals for the models as illustrated in Figure 2 below.<\/p>\n\n\n\n<div class=\"wp-block-media-text has-vertical-margin-small  has-vertical-padding-none  is-stacked-on-mobile\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"189\" height=\"186\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure2_Hexa-6036b1bcc68e1.png\" alt=\"\" class=\"wp-image-728542 size-full\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure2_Hexa-6036b1bcc68e1.png 189w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Figure2_Hexa-6036b1bcc68e1-12x12.png 12w\" sizes=\"auto, (max-width: 189px) 100vw, 189px\" \/><\/figure><div class=\"wp-block-media-text__content\">\n<p><sub>Figure 2: A visual Illustration example of augmented view space. Each circle \u25cf indicates a transformation-augmented view. The adversarial example (triangle\u25b2) fools the SSP model into making a prediction mistake, and the cut-mixed example (square\u25fc) is created between two standard augmentations.<\/sub><\/p>\n<\/div><\/div>\n\n\n\n<h2 id=\"state-of-the-art-performance\">State-of-the-art performance<\/h2>\n\n\n\n<p>Our empirical study for SSP is performed on the ImageNet dataset. All experiments are conducted with ResNet-50 and pretrained in 200 or 800 epochs. The competitive state-of-the-art methods are collected from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/moco\">MoCo-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/swav\">DeepCluster-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/facebookresearch\/swav\">SwAV<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/deepmind.com\/research\/publications\/-Bootstrap-Your-Own-Latent-A-new-approach-to-self-supervised-learning\">BYOL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/hobbitlong.github.io\/InfoMin\/\">InfoMin<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/salesforce\/PCL\">PCL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1912.01991\">PIRL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/google-research\/simclr\">SimCLR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/deepmind\/deepmind-research\/tree\/master\/bigbigan\">BigBiGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1906.05849\">CMC<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1807.03748\">CPC<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1603.09246\">Jigsaw<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1805.01978\">Instance Discrimination<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><strong>Linear evaluation. <\/strong>To evaluate learned representations, we first follow the widely used linear evaluation protocol, where a linear classifier is trained on top of the frozen base network and the validation accuracy is used to measure the quality of the representations. Table 1 shows the results.<\/p>\n\n\n\n<p>It is interesting that DeepCluster-v2 is slightly better than MoCo-v2, indicating that the traditional prototype methods can be as effective as the popular contrastive methods with the same pretraining epochs and data augmentation strategies. Notably, HEXA variants consistently outperform their contrastive and prototype baselines, showing that the proposed hard examples can effectively improve the learned visual representations in SSP.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"389\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Tables1.png\" alt=\"a screenshot of a cell phone\" class=\"wp-image-728518\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Tables1.png 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Tables1-300x243.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Tables1-16x12.png 16w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><figcaption>Table 2: HEXA achieves the best performance on both semi-supervised and fully supervised learning settings compared with existing SSP.<\/figcaption><\/figure><\/div>\n\n\n\n<p><strong>Fine-tuning evaluation. <\/strong>We now fine-tune the pretrained model to downstream vision tasks, such as image classification. Depending on the availability of task-specific data, we use either semi-supervised learning or supervised learning for fine-tuning. Specifically, we select a subset (1% or 10%) or use the full set (100%) of ImageNet training data and fine-tune the entire model on these datasets. Besides self-supervised methods, we also compared our method with methods specifically designed for semi-supervised learning, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1704.03976\">VAT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1905.03670\">S4L<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1904.12848\">UDA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2001.07685\">FixMatch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p>Table 2 reports the Top-1 and Top-5 accuracy on the ImageNet validation set. HEXA improves its counterparts, MoCo-v2 and DeepCluster-v2, in all cases. By fine-tuning for 20 epochs, HEXA reaches 78.6% Top-1 accuracy, outperforming the supervised approach (76.5%) while using the same ResNet-50 architecture by a large margin (2.1% absolute recognition accuracy).<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"523\" height=\"410\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Table-2.png\" alt=\"\" class=\"wp-image-728530\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Table-2.png 523w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Table-2-300x235.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/HEXA-Table-2-16x12.png 16w\" sizes=\"auto, (max-width: 523px) 100vw, 523px\" \/><figcaption>Table 2: HEXA achieves the best performance on both semi-supervised and fully supervised learning settings compared with existing SSP.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"scaling-up-hexa-to-achieve-few-shot-learning\">Scaling up HEXA to achieve few-shot learning<\/h2>\n\n\n\n<p>With HEXA, we have seen the power of constructing hard examples as pretraining data in improving self-supervised image representation learning. One interesting direction of future work stems from HEXA showcasing that both adversarial and cut-mixed hard examples improve pretraining. This idea can be generalized to incorporate more types of hard examples, such as mix-up examples. Our results on fine-tuning the full ImageNet dataset also show that combining self-supervised learning and standard supervised learning is a very promising approach to improve learned vision backbone.<\/p>\n\n\n\n<p>Another future research direction is to scale up this approach to billions of images, such as the scale in Big Transfer. Both SSP and OpenAI CLIP collectively contribute to the same goal: general visual representations with \u201cfew-shot\u201d or even \u201czero-shot\u201d learning capacities. The two methods are orthogonal: SSP conducts unsupervised learning on image-image pairs, while CLIP conducts supervised learning on image-text pairs that are freely available on the web. A better visual representation can be learned\u2014one that leverages both powers. HEXA can be explored in this joint-learning setting to improve the performance.<\/p>\n\n\n\n<h3 id=\"acknowledgments\">Acknowledgments<\/h3>\n\n\n\n<p>This research was conducted by <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chunyl\/\">Chunyuan Li<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiul\/\">Xiujun Li<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/leizhang\/\">Lei Zhang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/bapeng\/\">Baolin Peng<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/mingyuanzhou.github.io\/\">Mingyuan Zhou<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>. Additional thanks go to the entire Philly team inside Microsoft who provided us with the computing platform to do this research. The implementation in our experiments depends on open-source GitHub repositories; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Humans perceive the world through observing a large number of visual scenes around us and then effectively generalizing\u2014in other words, interpreting and identifying scenes they haven\u2019t encountered before\u2014without heavily relying on labeled annotations for every single scene. One of the core aspirations in artificial intelligence is to develop algorithms and techniques that endow computers with [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":728755,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Chunyuan Li","user_id":"37971"},{"type":"user_nicename","value":"Lei Zhang","user_id":"32641"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-728458","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931,737755],"related-projects":[737098],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-960x540.jpg\" class=\"img-object-cover\" alt=\"a graphic shows computer vision model using example of dog image\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Hexa_Blog_still_No_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Chunyuan Li, Lei Zhang, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"February 25, 2021","formattedExcerpt":"Humans perceive the world through observing a large number of visual scenes around us and then effectively generalizing\u2014in other words, interpreting and identifying scenes they haven\u2019t encountered before\u2014without heavily relying on labeled annotations for every single scene. One of the core aspirations in artificial intelligence&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/728458","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=728458"}],"version-history":[{"count":13,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/728458\/revisions"}],"predecessor-version":[{"id":728752,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/728458\/revisions\/728752"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/728755"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=728458"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=728458"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=728458"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=728458"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=728458"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=728458"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=728458"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=728458"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=728458"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=728458"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=728458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}