{"id":789503,"date":"2021-10-28T09:46:45","date_gmt":"2021-10-28T16:46:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=789503"},"modified":"2021-10-28T09:46:47","modified_gmt":"2021-10-28T16:46:47","slug":"acav100m-scaling-up-self-supervised-audio-visual-learning-with-automatically-curated-internet-videos","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/acav100m-scaling-up-self-supervised-audio-visual-learning-with-automatically-curated-internet-videos\/","title":{"rendered":"ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos"},"content":{"rendered":"\n<p>The natural association between visual observations and their corresponding sounds has exhibited <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-and-nvidia-introduce-parameter-efficient-multimodal-transformers-for-video-representation-learning\/\">powerful self-supervision signals for learning video representations<\/a>, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly aligned audio-visual signals because of overdubbed audio; models trained on uncurated videos <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1911.12667\">have been shown to learn suboptimal representations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> due to the misalignment issues. Therefore, existing approaches rely almost exclusively on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/deepmind.com\/research\/open-source\/kinetics\">manually curated datasets<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/research.google.com\/audioset\/\">a predetermined taxonomy of semantic concepts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, where there is a high chance of audio-visual correspondence. This severely limits the utility of online videos for self-supervised learning, which begs the question: How can we fully leverage online videos without extensive human effort?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-1024x576.jpg\" alt=\"ACAV100M text on top of a series of small images.\" class=\"wp-image-789512\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVAV100M-Fig1.jpg 1100w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Today, in collaboration with Seoul National University and NVIDIA Research, we are thrilled to release an automatic dataset curation pipeline and the largest video dataset for self-supervised audio-visual learning, dubbed\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/acav100m.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">ACAV100M<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0(\/\u02c8\u00e4r\u02cck\u012bv\/;\u00a0<strong>a<\/strong>utomatically\u00a0<strong>c<\/strong>urated\u00a0<strong>a<\/strong>udio-<strong>v<\/strong>isual dataset).\u00a0Our dataset is constructed from\u00a0uncurated\u00a0online videos at an unprecedented scale: We processed 140 million full-length videos (total duration 1,030 years,\u00a0consuming 2 petabytes of storage) and reduced it to 100 million clips (10 seconds each, 31 years in total) that exhibit the highest audio-visual correspondence. This is the largest video dataset in the\u00a0current\u00a0literature\u00a0and\u00a0is orders of magnitude larger than\u00a0existing datasets for audio-visual learning.\u00a0We\u00a0are\u00a0also releasing\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/sangho-vision\/acav100m\" target=\"_blank\" rel=\"noopener noreferrer\">the code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0so\u00a0that\u00a0the community can build upon our pipeline to create\u00a0other\u00a0datasets for self-supervised learning.\u00a0We provide full details of\u00a0the automatic dataset curation\u00a0pipeline in our\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/iccv2021.thecvf.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">ICCV 2021<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/acav100m-automatic-curation-of-large-scale-datasets-for-audio-visual-video-representation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual\u00a0Video\u00a0Representation Learning.<\/a>\u201d\u00a0<\/p>\n\n\n\n<h2 id=\"posing-dataset-creation-as-subset-selection-with-maximum-mutual-information\">Posing dataset creation as subset selection with maximum mutual information<\/h2>\n\n\n\n<p>We formulate a constrained optimization problem where the objective is to find a subset that maximizes the total mutual information between audio and visual channels in the videos. Mutual information (MI) tells us how much&nbsp;the knowledge of&nbsp;one variable reduces uncertainty about the other.&nbsp;A&nbsp;subset with the maximum MI is therefore likely to have a high portion of videos with audio-visual correspondence, which is necessary for self-supervised learning.&nbsp;<\/p>\n\n\n\n<p>One could estimate audio-visual MI per video, independently of each other, and create a subset that provides the maximum MI. This can be done, for example,&nbsp;via&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1807.03748\" target=\"_blank\" rel=\"noopener noreferrer\">contrastive learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;using co-occurring audio-visual signals as positive pairs. Unfortunately, we encounter the \u201cchicken-and-egg\u201d problem:&nbsp;contrastive learning requires a dataset from which we can reliably construct the positive pairs, but that is exactly what we set out to find in the first place! The literature already suggests that&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1911.12667\" target=\"_blank\" rel=\"noopener noreferrer\">models trained on&nbsp;uncurated&nbsp;videos suffer from suboptimal representations due to misaligned audio-visual signals<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In this work, rather than estimating MI at the instance level, we turn to set-based MI estimation by&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/jmlr.csail.mit.edu\/papers\/v6\/banerjee05a.html\" target=\"_blank\" rel=\"noopener noreferrer\">Banerjee et al. (2005)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;that quantifies the information shared by two clustering&nbsp;assignments of a dataset. Given a candidate subset, we cluster the videos with respect to audio and visual signals, respectively, and measure the MI from a contingency table that encodes the overlap between audio and visual clusters. We found this set-based approach to be more robust to real-world noise, which helps it&nbsp;create datasets with a higher audio-visual correspondence&nbsp;than instance-based MI estimation.&nbsp;<\/p>\n\n\n\n<p>Finding a global solution to our constrained optimization problem is NP-hard;\u00a0we implement a\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/link.springer.com\/article\/10.1007\/BF01588971\" target=\"_blank\" rel=\"noopener noreferrer\">greedy solution\u00a0widely used\u00a0in the\u00a0submodular\u00a0optimization literature<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Specifically, we use the batch greedy algorithm of\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/proceedings.mlr.press\/v28\/chen13b.html\" target=\"_blank\" rel=\"noopener noreferrer\">Chen and Krause (2013)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0that randomly samples a batch from the remaining pool of candidates and searches for the next element to be added to the solution only within that batch. This batch trick reduces the complexity from quadratic to linear in terms of the population size.\u00a0We implemented\u00a0several other techniques to make our whole pipeline scalable to hundreds of millions of video clips. We provide the full details of the pipeline\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/acav100m-automatic-curation-of-large-scale-datasets-for-audio-visual-video-representation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">in our paper<\/a>.\u00a0\u00a0<\/p>\n\n\n\n<h2 id=\"evaluating-the-utility-of-datasets-for-self-supervised-audio-visual-learning\">Evaluating the utility of datasets for self-supervised audio-visual learning<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"308\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig-1024x308.png\" alt=\"Graphs: Linear evaluation on downstream tasks. The top-1\/5 accuracy (%) of video classification on UCF101, audio classification on ESC-50 and audio-visual classification on Kinetics-Sounds (KS). We group the results by the downstream tasks and by the scale of the pretrain datasets. Baseline datasets are Kinetics-Sounds (20K), VGG-Sound (200K), and AudioSet (2M). All models are identical and trained using the same protocol except for the datasets used to pretrain them.\" class=\"wp-image-789515\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig-1024x308.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig-300x90.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig-768x231.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig-240x72.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/AVA100Fig.png 1405w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1. Linear evaluation on downstream tasks. The top-1\/5 accuracy (%) of video classification on UCF101, audio classification on ESC-50 and audio-visual classification on Kinetics-Sounds (KS). We group the results by the downstream tasks and by the scale of the pretrain datasets. Baseline datasets are Kinetics-Sounds (20K), VGG-Sound (200K), and AudioSet (2M). All models are identical and trained using the same protocol except for the datasets used to pretrain them.<\/figcaption><\/figure>\n\n\n\n<p>To evaluate the utility of our dataset in self-supervised audio-visual learning, we compare our dataset with popular video datasets widely used in the self-supervised learning literature. We pretrain identical models on different datasets via <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2002.05709\">contrastive learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and perform linear evaluation on standard benchmarks. The figure above shows that models pretrained on our datasets (\u201cclustering\u201d, green bars) perform as well as, and sometimes even outperform, the models pretrained on existing datasets (\u201cbaselines\u201d, pink bars) that involve human annotation or manual verification. This shows that datasets created by our approach indeed provide audio-visual correspondence necessary for self-supervised approaches. We provide in-depth analyses of the results <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2101.10803\">in our paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 id=\"looking-forward\">Looking forward<\/h2>\n\n\n\n<p>Large-scale datasets are the cornerstone of self-supervised representation learning. We hope that the video understanding community will find value in our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/acav100m.github.io\/\">ACAV100M<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset, which we release in the form of public URLs so anyone can access the data as long as the videos are still available online. We are open sourcing <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/sangho-vision\/acav100m\">the dataset creation pipeline<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which can be applied to other scenarios involving different modalities such as vision and language. We look forward to seeing rapid development in learning from unlabeled videos with our new dataset and tools.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly aligned audio-visual signals because of overdubbed audio; models trained on uncurated videos have been shown [&hellip;]<\/p>\n","protected":false},"author":40735,"featured_media":789698,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Yale Song","user_id":"37422"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-789503","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[694407],"related-projects":[],"related-events":[778099],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"ACAV100M text on top of a series of small images.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/1400x788_Acav_still_no_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Yale Song","formattedDate":"October 28, 2021","formattedExcerpt":"The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations, which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online videos often provide imperfectly aligned audio-visual signals because of&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/789503","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=789503"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/789503\/revisions"}],"predecessor-version":[{"id":789710,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/789503\/revisions\/789710"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/789698"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=789503"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=789503"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=789503"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=789503"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=789503"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=789503"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=789503"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=789503"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=789503"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=789503"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=789503"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}