{"id":1016412,"date":"2024-03-19T15:53:17","date_gmt":"2024-03-19T22:53:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=1016412"},"modified":"2024-03-26T10:25:37","modified_gmt":"2024-03-26T17:25:37","slug":"training-audio-captioning-models-without-audio","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/training-audio-captioning-models-without-audio\/","title":{"rendered":"Training Audio Captioning Models without Audio"},"content":{"rendered":"<p><span dir=\"ltr\" role=\"presentation\">Automated Audio Captioning (AAC) is the task of generating <\/span><span dir=\"ltr\" role=\"presentation\">natural language descriptions given an audio stream. A typ<\/span><span dir=\"ltr\" role=\"presentation\">ical AAC system requires manually curated training data of <\/span><span dir=\"ltr\" role=\"presentation\">audio segments and corresponding text caption annotations. <\/span><span dir=\"ltr\" role=\"presentation\">The creation of these audio-caption pairs is costly, resulting <\/span><span dir=\"ltr\" role=\"presentation\">in general data scarcity for the task. In this work, we address <\/span><span dir=\"ltr\" role=\"presentation\">this major limitation and propose an approach to train AAC <\/span><span dir=\"ltr\" role=\"presentation\">systems using only text. Our\u00a0 approach leverages the multi-<\/span><span dir=\"ltr\" role=\"presentation\">modal space of contrastively trained audio-text models, such <\/span><span dir=\"ltr\" role=\"presentation\">as CLAP. During training, a decoder generates captions con<\/span><span dir=\"ltr\" role=\"presentation\">ditioned on the pretrained CLAP text encoder. During infer<\/span><span dir=\"ltr\" role=\"presentation\">ence, the text encoder is replaced\u00a0 with the pretrained CLAP <\/span><span dir=\"ltr\" role=\"presentation\">audio encoder. To bridge the modality gap between text and <\/span><span dir=\"ltr\" role=\"presentation\">audio embeddings, we propose the use of noise injection or <\/span><span dir=\"ltr\" role=\"presentation\">a learnable adapter, during training.<\/span> <span dir=\"ltr\" role=\"presentation\">We find that the pro<\/span><span dir=\"ltr\" role=\"presentation\">posed text-only framework\u00a0 performs competitively with state-<\/span><span dir=\"ltr\" role=\"presentation\">of-the-art models trained with paired audio, showing that effi<\/span><span dir=\"ltr\" role=\"presentation\">cient text-to-audio transfer is possible. Finally, we showcase <\/span><span dir=\"ltr\" role=\"presentation\">both stylized audio captioning and caption enrichment while <\/span><span dir=\"ltr\" role=\"presentation\">training without audio or human-created text captions.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1016430\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-1024x346.png\" alt=\"graphical user interface, application\" width=\"1024\" height=\"346\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-1024x346.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-300x101.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-768x259.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-1536x519.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-2048x691.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4-240x81.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span dir=\"ltr\" role=\"presentation\">The first panel depicts the modality gap between CLAP pretrained audio and pretrained text embeddings in the joint audio-text <\/span><span dir=\"ltr\" role=\"presentation\">space. The second panel shows the proposed method of text-only training for Automated Audio Captioning. At inference, the text encoder is <\/span><span dir=\"ltr\" role=\"presentation\">swapped with the audio encoder and a caption is produced for the input audio. Only mapping network<\/span> <span dir=\"ltr\" role=\"presentation\">m<\/span> <span dir=\"ltr\" role=\"presentation\">is trainable, while modules with<\/span> <span dir=\"ltr\" role=\"presentation\">(snowflake) <\/span><span dir=\"ltr\" role=\"presentation\">are frozen. The Prefix is the output of<\/span> <span dir=\"ltr\" role=\"presentation\">m<\/span><span dir=\"ltr\" role=\"presentation\">. Singular arrows depict embedding vectors while multiple arrows indicate a sequence of vectors.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-1016439\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-1024x279.png\" alt=\"table\" width=\"1024\" height=\"279\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-1024x279.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-300x82.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-768x209.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-1536x418.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-2048x557.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/noaudiocap4_results_table-240x65.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span dir=\"ltr\" role=\"presentation\">The Table shows results of various models trained on both Au<\/span><span dir=\"ltr\" role=\"presentation\">dioCaps and Clotho. Models in rows 1-4 use both audio and <\/span><span dir=\"ltr\" role=\"presentation\">text in training. The proposed text-only model (row 5) uses <\/span><span dir=\"ltr\" role=\"presentation\">only text data and random Gaussian noise with a std of 0.015. <\/span><span dir=\"ltr\" role=\"presentation\">It achieves comparable performance with the best audio cap<\/span><span dir=\"ltr\" role=\"presentation\">tioning models in the literature and obtains a SPIDEr score of\u00a0 0.256 on Clotho and 0.455 on AudioCaps, higher than 0.215 and 0.437 reported by Kim et. al.\u00a0<\/span><\/p>\n<p><span dir=\"ltr\" role=\"presentation\">Text-only training is a valid alternative to training and\/or <\/span><span dir=\"ltr\" role=\"presentation\">initializing audio captioning systems. We also train our model <\/span><span dir=\"ltr\" role=\"presentation\">architecture made for text-only training with audio-text pairs. <\/span><span dir=\"ltr\" role=\"presentation\">The architecture is similar to Fig 1, where during training we <\/span><span dir=\"ltr\" role=\"presentation\">use audio files with an audio encoder instead of text with a <\/span><span dir=\"ltr\" role=\"presentation\">text encoder and Gaussian noise. This is the last and grayed <\/span><span dir=\"ltr\" role=\"presentation\">row in the Table above. The difference in SPIDEr score between the <\/span><span dir=\"ltr\" role=\"presentation\">audio-text and the text-only training is small: +0.02 on Au<\/span><span dir=\"ltr\" role=\"presentation\">dioCaps and +0.01 on Clotho.<\/span> <span dir=\"ltr\" role=\"presentation\">This indicates that our text-<\/span><span dir=\"ltr\" role=\"presentation\">only training can achieve comparable results without audio <\/span><span dir=\"ltr\" role=\"presentation\">data. The main benefit of text-only training is training on un<\/span><span dir=\"ltr\" role=\"presentation\">paired openly available text. We explore this in Section 5.1, <\/span><span dir=\"ltr\" role=\"presentation\">whereby using LLM-generated text, we show that text-only <\/span><span dir=\"ltr\" role=\"presentation\">training can improve over the audio-text training.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2024-4-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[243062,13555],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[247741],"msr-conference":[259657],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1016412","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-audio-acoustics","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-field-of-study-audio-signal-processing"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-4-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/TRAINING-AUDIO-CAPTIONING-MODELS-WITHOUT-AUDIO_2309.07372.pdf","id":"1016424","title":"training-audio-captioning-models-without-audio_2309-07372","label_id":"243132","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":1016424,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/03\/TRAINING-AUDIO-CAPTIONING-MODELS-WITHOUT-AUDIO_2309.07372.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Soham Deshmukh","user_id":40312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Soham Deshmukh"},{"type":"user_nicename","value":"Benjamin Elizalde","user_id":41662,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Benjamin Elizalde"},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Dimitra Emmanouilidou"},{"type":"text","value":"Bhiksha Raj","user_id":0,"rest_url":false},{"type":"text","value":"Rita Singh","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Huaming Wang","user_id":32052,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Huaming Wang"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144923],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412\/revisions"}],"predecessor-version":[{"id":1018218,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1016412\/revisions\/1018218"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1016412"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1016412"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1016412"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1016412"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1016412"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1016412"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1016412"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1016412"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1016412"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1016412"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1016412"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1016412"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1016412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}