{"id":1002270,"date":"2024-01-26T17:13:54","date_gmt":"2024-01-27T01:13:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=1002270"},"modified":"2024-01-26T17:13:54","modified_gmt":"2024-01-27T01:13:54","slug":"advanced-long-content-speech-recognition-with-factorized-neural-transducer","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/advanced-long-content-speech-recognition-with-factorized-neural-transducer\/","title":{"rendered":"Advanced Long-Content Speech Recognition with Factorized Neural Transducer"},"content":{"rendered":"<p>Long-content automatic speech recognition (ASR) has obtained increasing interest in recent years, as it captures the relationship among consecutive historical utterances while decoding the current utterance.<br \/>\nIn this paper, we propose two novel approaches, which integrate {long-content} information into the factorized neural transducer~(FNT) based architecture in both non-streaming~(referred to as LongFNT) and streaming~(referred to as SLongFNT) scenarios.<br \/>\nWe first investigate whether {long-content} transcriptions can improve the vanilla conformer transducer~(C-T) models.<br \/>\nOur experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing {long-content} transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model.<br \/>\nInstead, FNT shows its potential in utilizing {long-content} information, where we propose the LongFNT model and explore the impact of {long-content} information in both text~(LongFNT-Text) and speech~(LongFNT-Speech).<br \/>\nThe proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model.<br \/>\nThe effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19\\% and 12\\% word error rate reduction, respectively.<br \/>\nFurthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT, consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize {long-content} text and speech information.<br \/>\nExperiments show that the proposed SLongFNT model achieves relative 26\\% and 17\\% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline.<br \/>\nOverall, our proposed LongFNT and SLongFNT highlight the significance of considering {long-content} speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Long-content automatic speech recognition (ASR) has obtained increasing interest in recent years, as it captures the relationship among consecutive historical utterances while decoding the current utterance. In this paper, we propose two novel approaches, which integrate {long-content} information into the factorized neural transducer~(FNT) based architecture in both non-streaming~(referred to as LongFNT) and streaming~(referred to as [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2024-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13545],"msr-publication-type":[193715],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1002270","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/TASLP24__Advanced_Long_Content_Speech_Recognition_with_Factorized_Neural_Transducer.pdf","id":"1002273","title":"taslp24__advanced_long_content_speech_recognition_with_factorized_neural_transducer","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":1002273,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/TASLP24__Advanced_Long_Content_Speech_Recognition_with_Factorized_Neural_Transducer.pdf"}],"msr-author-ordering":[{"type":"text","value":"Xun Gong","user_id":0,"rest_url":false},{"type":"text","value":"Yu Wu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Jinyu Li","user_id":32312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jinyu Li"},{"type":"user_nicename","value":"Shujie Liu","user_id":33634,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Shujie Liu"},{"type":"text","value":"Rui Zhao","user_id":0,"rest_url":false},{"type":"text","value":"Xie Chen","user_id":0,"rest_url":false},{"type":"text","value":"Yanmin Qian","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"article","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1002270","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1002270\/revisions"}],"predecessor-version":[{"id":1002276,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1002270\/revisions\/1002276"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1002270"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1002270"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1002270"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1002270"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1002270"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1002270"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1002270"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1002270"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1002270"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1002270"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1002270"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1002270"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1002270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}