{"id":804160,"date":"2021-12-17T15:22:31","date_gmt":"2021-12-17T23:22:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=804160"},"modified":"2021-12-20T07:18:37","modified_gmt":"2021-12-20T15:18:37","slug":"azure-ai-milestone-new-neural-text-to-speech-models-more-closely-mirror-natural-speech","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/azure-ai-milestone-new-neural-text-to-speech-models-more-closely-mirror-natural-speech\/","title":{"rendered":"Azure AI milestone: New Neural Text-to-Speech models more closely mirror natural speech"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1024x576.jpg\" alt=\"diagram\" class=\"wp-image-806419\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Neural Text-to-Speech\u2014along with recent milestones in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/azure-ai-milestone-new-foundation-model-florence-v1-0-pushing-vision-and-vision-language-state-of-the-art\/\">computer vision<\/a> and question answering\u2014is part of a larger <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/overview\/ai-platform\/\">Azure AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work\u2014with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, a joint representation of three cognitive attributes: monolingual text (X), audio or visual sensory signals (Y), and multilingual (Z). For more information about these efforts, read the <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/\"><em>XYZ-code blog post<\/em><\/a><em>.&nbsp;<\/em><\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/text-to-speech\/\" target=\"_blank\" rel=\"noopener noreferrer\">Neural Text-to-Speech<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u202f(Neural TTS), a powerful speech synthesis capability of&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/\" target=\"_blank\" rel=\"noopener noreferrer\">Azure Cognitive Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, enables developers to convert text to lifelike speech. It is used in voice assistant scenarios, content read aloud capabilities, accessibility tools, and more. Neural TTS has now reached a significant milestone in Azure, with a new generation of Neural TTS model called Uni-TTSv4, whose quality shows no significant difference from sentence-level natural speech recordings.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Microsoft debuted the&nbsp;original&nbsp;technology three years ago, with&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/microsoft-s-new-neural-text-to-speech-service-helps-machines-speak-like-people\/\" target=\"_blank\" rel=\"noopener noreferrer\">close to human-parity<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;quality. This resulted in TTS audio that was more fluid, natural sounding, and better articulated. Since then, Neural TTS has been incorporated into Microsoft flagship products such as&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.youtube.com\/watch?v=a9FN11y9qEQ\" target=\"_blank\" rel=\"noopener noreferrer\">Edge Read Aloud<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.onenote.com\/learningtools\" target=\"_blank\" rel=\"noopener noreferrer\">Immersive Reader<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/insider.office.com\/en-us\/blog\/new-natural-sounding-voices-come-to-read-aloud\" target=\"_blank\" rel=\"noopener noreferrer\">Word Read Aloud<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. It\u2019s also been adopted by many customers such as&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.youtube.com\/watch?v=MkeI7Aaf7hk\" target=\"_blank\" rel=\"noopener noreferrer\">AT&T<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/blogs.microsoft.com\/ai-for-business\/custom-neural-voice-ga\/\" target=\"_blank\" rel=\"noopener noreferrer\">Duolingo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/news.microsoft.com\/transform\/progressive-gives-voice-to-flos-chatbot-and-its-as-no-nonsense-and-reassuring-as-she-is\/\" target=\"_blank\" rel=\"noopener noreferrer\">Progressive<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and more. Users can choose from multiple pre-set voices or record and upload their own sample to create custom voices instead. Over 110 languages are supported, including a wide array of language variants, also known as locales.&nbsp;&nbsp;<\/p>\n\n\n\n<p>The latest version of the model,&nbsp;Uni-TTSv4, is now shipping into production on a first set of eight voices (shown in the table below). We will continue to roll\u202fout the\u202fnew model\u202farchitecture to the remaining 110-plus&nbsp;languages\u202fand&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/speech.microsoft.com\/customvoice\" target=\"_blank\" rel=\"noopener noreferrer\">Custom Neural Voice<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;in the coming milestone. Our users will automatically get significantly better-quality\u202fTTS through the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/text-to-speech\/\" target=\"_blank\" rel=\"noopener noreferrer\">Azure TTS API<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u202fMicrosoft Office,\u202fand\u202fEdge browser.\u202f<\/p>\n\n\n\n<h2 id=\"measuring-tts-quality\">Measuring TTS quality<\/h2>\n\n\n\n<p>Text-to-speech quality is measured by the Mean Opinion Score (MOS), a widely recognized scoring method for speech quality evaluation. For MOS studies, participants rate speech characteristics for both recordings of peoples\u2019 voices and TTS voices on a five-point scale. These characteristics include sound quality, pronunciation, speaking rate, and articulation. For any model improvement, we first conduct a side-by-side comparative MOS test (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-ai\/azure-neural-tts-upgraded-with-hifinet-achieving-higher-audio\/ba-p\/1847860\" target=\"_blank\" rel=\"noopener noreferrer\">CMOS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) with production models. Then, we do a blind MOS test on the held-out recording set (recordings not used in training) and the TTS-synthesized audio and measure the difference between the two MOS scores.&nbsp;<\/p>\n\n\n\n<p>During research of the new model, Microsoft submitted the Uni-TTSv4 system to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.synsig.org\/index.php\/Blizzard_Challenge_2021\" target=\"_blank\" rel=\"noopener noreferrer\">Blizzard Challenge 2021<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> under its code name, DelightfulTTS. Our paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/delightfultts-the-microsoft-speech-synthesis-system-for-blizzard-challenge-2021\/\">DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021<\/a>,\u201d provides in-depth detail of our research and the results. The Blizzard Challenge is a well-known TTS benchmark organized by world-class experts in TTS fields, and it conducts large-scale MOS tests on multiple TTS systems with hundreds of listeners. Results from Blizzard Challenge 2021 demonstrate that the voice built with the new model shows no significant difference from natural speech on the common dataset.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144027\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-label=\"AI Testing and Evaluation: Learnings from Science and Industry\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/EP2-AI-TE_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Testing and Evaluation: Learnings from Science and Industry<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"large\">Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-describedby=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"measurement-results-for-uni-ttsv4-and-comparison\">Measurement results for Uni-TTSv4 and comparison<\/h2>\n\n\n\n<p>The MOS scores below are based on samples produced by the Uni-TTSv4 model under the constraints of real-time performance requirements.<\/p>\n\n\n\n<p>A <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Wilcoxon_signed-rank_test\" target=\"_blank\" rel=\"noopener noreferrer\">Wilcoxon signed-rank test<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> was used to determine whether the MOS scores differed significantly between the held-out recordings and TTS. A\u202fp-value less than 0.05 (typically \u2264 0.05) is statistically significant and a\u202fp-value higher than 0.05 (> 0.05) is not statistically significant. A positive CMOS number shows gain over production, which shows it is more highly preferred by people judging the voice in terms of naturalness.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter is-style-regular\"><table><thead><tr><th>Locale (voice)<\/th><th>Human recording (MOS)<\/th><th>Uni-TTSv4 (MOS)<\/th><th>Wilcoxon p-value<\/th><th>CMOS vs PROD<\/th><\/tr><\/thead><tbody><tr><td>En-US (Jenny)&nbsp;<\/td><td>4.33(\u00b10.04)&nbsp;<\/td><td>4.29(\u00b10.04)&nbsp;<\/td><td>0.266&nbsp;<\/td><td>+0.116&nbsp;<\/td><\/tr><tr><td>En-US (Sara)<\/td><td>4.16(\u00b10.05)&nbsp;<\/td><td>4.12 (\u00b10.05)<\/td><td>0.41&nbsp;<\/td><td>+0.129&nbsp;<\/td><\/tr><tr><td>Zh-CN (Xiaoxiao)<\/td><td>4.54(\u00b10.05)&nbsp;<\/td><td>4.51(\u00b10.05)&nbsp;<\/td><td>0.44&nbsp;<\/td><td>+0.181&nbsp;<\/td><\/tr><tr><td>It-IT (Elsa)&nbsp;<\/td><td>4.59(\u00b10.04)&nbsp;<\/td><td>4.58(\u00b10.03)&nbsp;<\/td><td>0.34&nbsp;<\/td><td>+0.25&nbsp;<\/td><\/tr><tr><td>Ja-JP (Nanami)&nbsp;<\/td><td>4.44(\u00b10.04)&nbsp;<\/td><td>4.37(\u00b10.05)&nbsp;<\/td><td>0.053&nbsp;<\/td><td>+0.19&nbsp;<\/td><\/tr><tr><td>Ko-KR(Sun-hi)&nbsp;<\/td><td>4.24(\u00b10.06)&nbsp;<\/td><td>4.15(\u00b10.06)&nbsp;<\/td><td>0.11&nbsp;<\/td><td>+0.097&nbsp;<\/td><\/tr><tr><td>Es-ES (Alvaro)&nbsp;<\/td><td>4.36(\u00b10.05)&nbsp;<\/td><td>4.33(\u00b10.04)&nbsp;<\/td><td>0.312&nbsp;<\/td><td>+0.18&nbsp;<\/td><\/tr><tr><td>Es-MX (Dalia)&nbsp;<\/td><td>4.45 (\u00b10.05)&nbsp;<\/td><td>4.39(\u00b10.05)&nbsp;<\/td><td>0.103&nbsp;<\/td><td>+0.076&nbsp;<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 id=\"a-comparison-of-human-and-uni-ttsv4-audio-samples\">A comparison of human and Uni-TTSv4 audio samples&nbsp;<\/h2>\n\n\n\n<p>Listen to the recording and TTS samples below to hear the quality of the new model. Note that the recording is not part of the training set.<\/p>\n\n\n\n<p>These voices are updated to the new model in the Azure TTS online service. You can also <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/services\/cognitive-services\/text-to-speech\/#features\" target=\"_blank\" rel=\"noopener noreferrer\">try the demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with your own text. More voices will be upgraded to Uni-TTSv4 later.<\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>En-US (Jenny)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">The visualizations of the vocal quality continue in a quartet and octet.<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Jenny_NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Jenny_TTS_new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>En-US (Sara)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">Like other visitors, he is a believer.<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Sara-NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Sara-TTS-new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Zh-CN (Xiaoxiao)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">\u53e6\u5916,\u4e5f\u8981\u89c4\u907f\u5f53\u524d\u7684\u5730\u7f18\u5c40\u52bf\u98ce\u9669,\u7b49\u5f85\u5408\u9002\u7684\u65f6\u673a\u4ecb\u5165\u3002<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Xiaoxiao-NonTTS-RECORDING.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Xiaoxiao-TTS-NEW-Wave.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>It-IT (Elsa)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">La riunione del Consiglio di Federazione era prevista per ieri.<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Elsa-NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Elsa-TTS-new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Ja-JP (Nanami)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">\u8cac\u4efb\u306f\u3069\u3046\u306a\u308b\u306e\u3067\u3057\u3087\u3046\u304b?<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Nanami-NonTTS_recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Nanami-TTS_new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Ko-KR (Sun-hi)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">\uadf8\ub294 \ub9c8\uc9c0\ub9c9\uc73c\ub85c \uc774\ubc88 \uc568\ubc94 \ud65c\ub3d9 \uac01\uc624\ub97c \ubc1d\ud788\uba70 \uc778\ud130\ubdf0\ub97c \ub9c8\ucce4\ub2e4<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Sunhi-NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Sunhi-TTS-new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Es-ES (Alvaro)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">Al parecer, se trata de una operaci\u00f3n vinculada con el tr\u00e1fico de drogas.<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Alvaro-NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Alvaro-TTS-new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Es-MX (Dalia)<\/strong><\/p>\n\n\n\n<p class=\"has-text-align-center\">Haber desempe\u00f1ado el papel de Primera Dama no es una tarea sencilla.<\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Human recording<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Dalia-NonTTS-recording.wav\"><\/audio><\/figure>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\"><strong>Uni-TTSv4<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-audio aligncenter\"><audio controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/Dalia-TTS-new.wav\"><\/audio><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 id=\"how-uni-ttsv4-works-to-better-represent-human-speech\">How Uni-TTSv4 works to better represent human speech<\/h2>\n\n\n\n<p>Over the past 3 years, Microsoft has been improving its engine to make TTS that more closely aligns with human speech. While the typical Neural TTS quality of synthesized speech has been impressive, the perceived quality and naturalness still have space to improve compared to human speech recordings. We found this is particularly the case when people listen to TTS for a while. It is in the very subtle nuances, such as variations in tone or pitch, that people are able to tell whether a speech is generated by AI.&nbsp;<\/p>\n\n\n\n<p>Why is it so hard for a TTS voice to reflect human vocal expression more closely? Human speech is usually rich and dynamic. With different emotions and in different contexts, a word is spoken differently. And in many languages this difference can be very subtle. The expressions of a TTS voice are modeled with various acoustic parameters. Currently it is not very efficient for those parameters to model all the coarse-grained and fine-grained details on the acoustic spectrum of human speech. TTS is also a typical one-to-many mapping problem where there could be multiple varying speech outputs (for example, pitch, duration, speaker, prosody, style, and others) for a given text input. Thus, modeling such variation information is important to improve the expressiveness and naturalness of synthesized speech.&nbsp;<\/p>\n\n\n\n<p>To achieve these improvements in quality and naturalness, Uni-TTSv4 introduces two significant updates in acoustic modeling. In general, transformer models learn the global interaction while convolutions efficiently capture local correlations. First, there\u2019s a new architecture with transformer and convolution blocks, which better model the local and global dependencies in the acoustic model. Second, we model variation information systematically from both explicit perspectives (speaker ID, language ID, pitch, and duration) and implicit perspectives (utterance-level and phoneme-level prosody). These perspectives use supervised and unsupervised learning respectively, which ensures end-to-end audio naturalness and expressiveness. This method achieves a good balance between model performance and controllability, as illustrated below:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Acoustic model and vocoder diagram, described from left to right. Text is input into a text encoder. An arrow points from the text encoder to a spectrum decoder. Both implicit and explicit information are input between the encoder and decoder stages. From the Spectrum decoder, and arrow points to a vocoder. The vocoder points to an audio wave visual representation, representing conversion from mel spectrum into audio samples.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/text-encoder.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"520\" height=\"200\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/text-encoder.jpg\" alt=\"Acoustic model and vocoder diagram, described from left to right. Text is input into a text encoder. An arrow points from the text encoder to a spectrum decoder. Both implicit and explicit information are input between the encoder and decoder stages. From the Spectrum decoder, and arrow points to a vocoder. The vocoder points to an audio wave visual representation, representing conversion from mel spectrum into audio samples.\" class=\"wp-image-804550\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/text-encoder.jpg 520w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/text-encoder-300x115.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/text-encoder-240x92.jpg 240w\" sizes=\"auto, (max-width: 520px) 100vw, 520px\" \/><\/a><figcaption>Figure 1: Acoustic model and vocoder diagram of Uni-TTSv4. First text input is encoded with text encoder, and then implicit and explicit information are added to hidden embeddings from text encoder, which is then used to predict the mel-spectogram with a spectrum decoder. Lastly, the vocoder is used to convert mel-spectogram into audio samples.<\/figcaption><\/figure><\/div>\n\n\n\n<p>To achieve better voice quality, the basic modelling block needs fundamental improvement. The global and local interactions are especially important for non-autoregressive TTS, considering it has a longer output sequence than machine translation or speech recognition in the decoder, and each frame in the decoder cannot see its history as the autoregressive model does. So, we designed a new modelling block which combines the best of transformer and convolution, where self-attention learns the global interaction while the convolutions efficiently capture the local correlations.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Improved conformer module diagram from bottom to top. Four layers represented by boxes are each joined by a sub-layer labeled \u201cAdd and Norm.\u201d Two arrows at the bottom of each layer point to the base layer as well as the sub-layer boxes. The first layer is labeled \u201cConv Feed Forward.\u201d The second layer is \u201cDepthwise Convolution.\u201d The Third is \u201cSelf Attention.\u201d The fourth is \u201cConv Feed Forward.\u201d\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/feed.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"520\" height=\"420\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/feed.jpg\" alt=\"Improved conformer module diagram from bottom to top. Four layers represented by boxes are each joined by a sub-layer labeled \u201cAdd and Norm.\u201d Two arrows at the bottom of each layer point to the base layer as well as the sub-layer boxes. The first layer is labeled \u201cConv Feed Forward.\u201d The second layer is \u201cDepthwise Convolution.\u201d The Third is \u201cSelf Attention.\u201d The fourth is \u201cConv Feed Forward.\u201d\" class=\"wp-image-804544\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/feed.jpg 520w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/feed-300x242.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/feed-223x180.jpg 223w\" sizes=\"auto, (max-width: 520px) 100vw, 520px\" \/><\/a><figcaption>Figure 2: The improved conformer module. The first layer is a convolutional feed-forward layer; the second layer is a depth-wise convolutional layer; the third layer is a self-attention layer; and the last layer is also a convolutional feed-forward layer. Every sub-layer is followed by a layer norm.<\/figcaption><\/figure><\/div>\n\n\n\n<p>The new variance adaptor, based on <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/fastspeech-2-fast-and-high-quality-end-to-end-text-to-speech\/\">FastSpeech2<\/a>, introduces a hierarchical implicit information modelling pipeline from utterance-level prosody and phoneme-level prosody perspectives, together with the explicit information like duration, pitch, speaker ID, and language ID. Modeling these variations can effectively mitigate the one-to-many mapping problem and improves the expressiveness and fidelity of synthesized speech.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Variance adaptor diagram from bottom to top. Along the left side, an arrow moves from bottom to top, ending in a circle labeled \u201cLR\u201d and showing the full modeling process. To the right of the vertical arrow, Language and Speaker ID are added to hidden embeddings. Next, vectors are predicted with an utterance-level prosody predictor and then a phoneme-level prosody predictor. Then, a pitch predictor is used, and the hidden representation is expanded with a duration predictor.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/predictor.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"520\" height=\"430\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/predictor.jpg\" alt=\"Variance adaptor diagram from bottom to top. Along the left side, an arrow moves from bottom to top, ending in a circle labeled \u201cLR\u201d and showing the full modeling process. To the right of the vertical arrow, Language and Speaker ID are added to hidden embeddings. Next, vectors are predicted with an utterance-level prosody predictor and then a phoneme-level prosody predictor. Then, a pitch predictor is used, and the hidden representation is expanded with a duration predictor.\" class=\"wp-image-804547\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/predictor.jpg 520w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/predictor-300x248.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/predictor-218x180.jpg 218w\" sizes=\"auto, (max-width: 520px) 100vw, 520px\" \/><\/a><figcaption>Figure 3: Variance adaptor with explicit and implicit variation information modeling. First, explicit speaker ID and language ID, along with pitch information, are added to hidden embeddings from text encoder with lookup table. Then, implicit utterance-level and phoneme-level prosody vectors are predicted from text hidden. Finally, the hidden representation is expanded with predicted duration.<\/figcaption><\/figure><\/div>\n\n\n\n<p>We use our previously proposed <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/techcommunity.microsoft.com\/t5\/azure-ai-blog\/azure-neural-tts-upgraded-with-hifinet-achieving-higher-audio\/ba-p\/1847860\" target=\"_blank\" rel=\"noopener noreferrer\">HiFiNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014a new generation of Neural TTS vocoder\u2014to convert spectrum into audio samples.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/delightfultts-the-microsoft-speech-synthesis-system-for-blizzard-challenge-2021\/\" data-bi-cN=\"DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>For more details of the above system, refer to the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/delightfultts-the-microsoft-speech-synthesis-system-for-blizzard-challenge-2021\/\">paper<\/a>.&nbsp;<\/p>\n\n\n\n<h2 id=\"working-to-advance-ai-with-xyz-code-in-a-responsible-way\">Working to advance AI with XYZ-code in a responsible way<\/h2>\n\n\n\n<p>We\u202fare excited about the future\u202fof\u202fNeural TTS with human-centric and natural-sounding quality under the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/a-holistic-representation-toward-integrative-ai\/\">XYZ-Code<\/a> AI framework. Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these\u202f<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nam12.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fai%2Fresponsible-ai%3Factivetab%3Dpivot1%253aprimaryr6&data=04%7C01%7Calexis%402a.consulting%7C4dfd4aa0669b458d20e808d9c0225ec3%7C611660b55b38418782b0a0aab20f2e23%7C1%7C1%7C637752076160854319%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=T6JqUTu2ptQUb29pqo%2FxSp%2Bzz04skg9vbYMSyhRcNFo%3D&reserved=0\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft AI principles<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u202finto practice throughout the company and strongly encourage developers to do the same. For guidance on deploying AI responsibly, visit&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nam12.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fcognitive-services%2Fresponsible-use-of-ai-overview&data=04%7C01%7Calexis%402a.consulting%7C4dfd4aa0669b458d20e808d9c0225ec3%7C611660b55b38418782b0a0aab20f2e23%7C1%7C1%7C637752076160864316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SMiP%2FfaylZ4Ah%2Fu%2FdL5%2BFQLCGL5c0MinqKjneVtFx14%3D&reserved=0\" target=\"_blank\" rel=\"noopener noreferrer\">Responsible use of AI with Cognitive Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<h3 id=\"get-started-with-neural-tts-in-azure\">Get started with Neural TTS in Azure<\/h3>\n\n\n\n<p>Neural TTS in Azure offers <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/azure\/cognitive-services\/speech-service\/language-support#standard-voices\" target=\"_blank\" rel=\"noopener noreferrer\">over 270 neural voices across over 110 languages<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and locales. In addition, the capability enables organizations to create a unique brand voice in multiple languages and styles. To explore the capabilities of Neural TTS with some of its different voice offerings, try the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/text-to-speech\/#features\" target=\"_blank\" rel=\"noopener noreferrer\">demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><strong>For more information:<\/strong>&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Read our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/cognitive-services\/speech-service\/index-text-to-speech\" target=\"_blank\" rel=\"noopener noreferrer\">documentation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/li><li>Check out our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/Azure-Samples\/cognitive-services-speech-sdk#speech-synthesis-quickstarts\" target=\"_blank\" rel=\"noopener noreferrer\">sample code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/li><li>Check out the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/legal\/cognitive-services\/speech-service\/tts-code-of-conduct?context=\/azure\/cognitive-services\/speech-service\/context\/context\" target=\"_blank\" rel=\"noopener noreferrer\">code of conduct<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for integrating Neural TTS into your apps.<\/li><\/ul>\n\n\n\n<h3 id=\"acknowledgments\">Acknowledgments<\/h3>\n\n\n\n<p>The research behind Uni-TTSv4 was conducted by a team of researchers from across Microsoft, including Yanqing Liu, Zhihang Xu, Xu Tan, Bohan Li, Xiaoqiang Wang, Songze Wu, Jie Ding, Peter Pan, Cheng Wen, Gang Wang, Runnan Li, Jin Wu, Jinzhu Li, Xi Wang, Yan Deng, Jingzhou Yang, Lei He, Sheng Zhao, Tao Qin, Tie-Yan Liu, Frank Soong, Li Jiang, Xuedong Huang with the support from all the Azure Speech and Cognitive Services team members, Integrated Training Platform, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/onnxruntime.ai\/\">ONNX Runtime teams<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for making this great accomplishment possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Neural Text-to-Speech\u2014along with recent milestones in computer vision and question answering\u2014is part of a larger Azure AI (opens in new tab) mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work\u2014with improved vision, knowledge understanding, and speech capabilities. At the center of [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":806419,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Sheng Zhao","user_id":"41137"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-804160","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Sheng Zhao","user_id":41137,"display_name":"Sheng Zhao","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/szhao\/\" aria-label=\"Visit the profile page for Sheng Zhao\">Sheng Zhao<\/a>","is_active":false,"last_first":"Zhao, Sheng","people_section":0,"alias":"szhao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-960x540.jpg\" class=\"img-object-cover\" alt=\"diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/12\/1400x788_NeuralTTS_no_logo_hero-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/szhao\/\" title=\"Go to researcher profile for Sheng Zhao\" aria-label=\"Go to researcher profile for Sheng Zhao\" data-bi-type=\"byline author\" data-bi-cN=\"Sheng Zhao\">Sheng Zhao<\/a>","formattedDate":"December 17, 2021","formattedExcerpt":"Neural Text-to-Speech\u2014along with recent milestones in computer vision and question answering\u2014is part of a larger Azure AI (opens in new tab) mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work\u2014with improved&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/804160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=804160"}],"version-history":[{"count":26,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/804160\/revisions"}],"predecessor-version":[{"id":806584,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/804160\/revisions\/806584"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/806419"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=804160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=804160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=804160"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=804160"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=804160"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=804160"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=804160"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=804160"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=804160"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=804160"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=804160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}