{"id":420618,"date":"2017-08-20T17:58:40","date_gmt":"2017-08-21T00:58:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=420618"},"modified":"2018-06-13T08:02:39","modified_gmt":"2018-06-13T15:02:39","slug":"microsoft-researchers-achieve-new-conversational-speech-recognition-milestone","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone\/","title":{"rendered":"Microsoft researchers achieve new conversational speech recognition milestone"},"content":{"rendered":"<p>Last year, Microsoft\u2019s speech and dialog research group <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blogs.microsoft.com\/ai\/2016\/10\/18\/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition\/\">announced<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers.<\/p>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/windows\/cortana\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-420774 size-medium\" title=\"Cortana Translator\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat-300x300.jpg\" alt=\"Cortana Chat\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat-300x300.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat-150x150.jpg 150w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat-180x180.jpg 180w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat-360x360.jpg 360w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Cortana-Chat.jpg 471w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>After our transcription system reached the 5.9 percent word error rate that we had measured for humans, other researchers conducted their own study, employing a more involved multi-transcriber process, which yielded a 5.1 human parity word error rate. This was consistent with prior research that showed that humans achieve higher levels of agreement on the precise words spoken as they expend more care and effort. Today, I\u2019m excited to announce that our research team reached that 5.1 percent error rate with our speech recognition system, a new industry milestone, substantially surpassing the accuracy we achieved last year. A <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1708.06073\" target=\"_blank\" rel=\"noopener noreferrer\">technical report<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> published this weekend documents the details of our system.<\/p>\n<p>Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems. The task involves transcribing conversations between strangers discussing topics such as sports and politics.<\/p>\n<p>We reduced our error rate by about 12 percent compared to last year\u2019s accuracy level, using a series of improvements to our neural net-based acoustic and language models. We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame\/senone and word levels.<\/p>\n<p>Moreover, we strengthened the recognizer\u2019s language model by using the entire history of a\u00a0dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.<\/p>\n<p>Our team also has benefited greatly from using the most scalable deep learning software available, <a href=\"https:\/\/www.microsoft.com\/en-us\/cognitive-toolkit\/\">Microsoft Cognitive Toolkit 2.1<\/a> (CNTK), for exploring model architectures and optimizing the hyper-parameters of our models. Additionally, Microsoft\u2019s investment in cloud compute infrastructure, specifically <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/resources\/videos\/azurecon-2015-applications-that-scale-using-gpu-compute\/\">Azure GPUs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, helped to improve the effectiveness and speed by which we could train our models\u00a0and test new ideas.<\/p>\n<p>Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years.\u00a0Microsoft\u2019s willingness to invest in long-term research is now paying dividends for our customers in products and services such as <a href=\"https:\/\/www.microsoft.com\/en-us\/windows\/cortana\">Cortana<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=55024\">Presentation Translator<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/\">Microsoft Cognitive Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. It\u2019s deeply gratifying to our research teams to see our work used by millions of people each day.<\/p>\n<div id=\"attachment_420777\" style=\"width: 730px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=55024\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-420777\" class=\"wp-image-420777 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Garage-Screenshot-2.jpg\" alt=\"Presentation Translator\" width=\"720\" height=\"425\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Garage-Screenshot-2.jpg 720w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Garage-Screenshot-2-300x177.jpg 300w\" sizes=\"auto, (max-width: 720px) 100vw, 720px\" \/><\/a><p id=\"caption-attachment-420777\" class=\"wp-caption-text\">Advances in speech recognition have created services such as <a href=\"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=55024\">Speech Translator<\/a>, which can translate presentations in real-time for multi-lingual audiences.<\/p><\/div>\n<p>Many research groups in industry and academia are doing great work in speech recognition, and our own work has greatly benefitted from the community\u2019s overall progress.\u00a0While achieving a 5.1 percent word error rate on the Switchboard speech recognition task is a significant achievement, the speech research community still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available. Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent. Moving from recognizing to understanding speech is the next major frontier for speech technology.<\/p>\n<p><strong>Related:<\/strong><\/p>\n<ul>\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/microsoft-2017-conversational-speech-recognition-system\/\">Technical Report: The Microsoft 2017 Conversational Speech Recognition System<\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blogs.microsoft.com\/ai\/2016\/10\/18\/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition\/\">Microsoft researchers reach human parity in conversational speech recognition<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/blogs.microsoft.com\/next\/2016\/09\/13\/microsoft-researchers-achieve-speech-recognition-milestone\/#sm.0000v8bqc0rcdflis6g1wekfl455r\">Microsoft researchers achieve speech recognition milestone<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/news.microsoft.com\/features\/speak-hear-talk-the-long-quest-for-technology-that-understands-speech-as-well-as-a-human\/#sm.0000v8bqc0rcdflis6g1wekfl455r\">Speak, hear talk: The quest to create technology that understands speech as well as a human<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blogs.microsoft.com\/jobs\/worlds-apart-try-coming-together\/\">Worlds apart? Try\u00a0coming together<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/x.com\/xdh\">Follow Xuedong Huang on Twitter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n<div style=\"left: -3000px; width: 1000px; overflow: hidden; position: fixed;\" contenteditable=\"false\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Last year, Microsoft\u2019s speech and dialog research group announced a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers. After our transcription system reached the 5.9 percent word error rate that we had measured for [&hellip;]<\/p>\n","protected":false},"author":35981,"featured_media":420651,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Xuedong Huang","user_id":"34869"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194462],"tags":[],"research-area":[13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-420618","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-speech-and-dialog","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"600\" height=\"256\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Speech-Soundwave-655x255-Feature.jpg\" class=\"img-object-cover\" alt=\"Image of soundwaves\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Speech-Soundwave-655x255-Feature.jpg 600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/08\/Speech-Soundwave-655x255-Feature-300x128.jpg 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/>","byline":"Xuedong Huang","formattedDate":"August 20, 2017","formattedExcerpt":"Last year, Microsoft\u2019s speech and dialog research group announced a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers. After our transcription system reached the&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/420618","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/35981"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=420618"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/420618\/revisions"}],"predecessor-version":[{"id":491048,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/420618\/revisions\/491048"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/420651"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=420618"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=420618"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=420618"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=420618"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=420618"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=420618"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=420618"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=420618"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=420618"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=420618"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=420618"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}