{"id":1004841,"date":"2024-02-06T09:53:20","date_gmt":"2024-02-06T17:53:20","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1004841"},"modified":"2024-07-04T11:11:20","modified_gmt":"2024-07-04T18:11:20","slug":"elate","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/elate\/","title":{"rendered":"ELaTE"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-catalina-blue card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"elate\">ELaTE  <\/h1>\n\n\n\n<p>Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n\n\n<p><strong>ELaTE<\/strong> is a zero-shot text-to-speech (TTS) system that can generate <strong>natural laughing speech from any speaker<\/strong> based on a speaker prompt to mimic the voice characteristic, a text prompt to indicate the contents of the generated speech, and an input to control the laughter expression.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/abs\/2402.07383\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p>ELaTE has the following key features:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Precise control of laughter timing: <\/strong>A user can specify the timing for laughter, which critically affects the nuance of the generated speech. <\/li>\n\n\n\n<li><strong>Precise control of laughter expression: <\/strong>A user can guide the laughter expression using an example audio containing laughter. <\/li>\n\n\n\n<li><strong>Build upon a well-trained zero-shot TTS: <\/strong>ELaTE can generate natural speech without compromising audio quality and with a negligible increase in computational cost compared to the conventional zero-shot TTS model.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-text-align-center\">Generated laughing speech samples by ELaTE<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_1_B.wav\"><\/audio>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_3_A.wav\"><\/audio>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_2_B.wav\"><\/audio><\/audio>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"526\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-1024x526.png\" alt=\"Overview of ELaTE\" class=\"wp-image-1006047\" style=\"width:554px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-1024x526.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-300x154.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-768x394.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-1536x788.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-2048x1051.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v2-65c5946676871-240x123.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center is-style-default\" id=\"speech-to-speech-translation\">  Speech-to-speech translation<\/h2>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"200\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3-1024x200.png\" alt=\"Application for Speech-to-Speech Translation\" class=\"wp-image-1006530\" style=\"width:565px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3-1024x200.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3-300x59.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3-768x150.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3-240x47.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/main-for-web-page-v3.png 1404w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p>ELaTE can be applied to speech-to-speech translation, <strong>transferring not only the voice characteristic but also the precise nuance of the source audio<\/strong> with&nbsp;unprecedented quality.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-left\">Original speech (Chinese)<\/p>\n\n\n\n<audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_1_C.wav\"><\/audio>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-left\">Generated speech (English)<\/p>\n\n\n\n<audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_1_C.wav\"><\/audio>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center is-style-default\" id=\"model-overview\">Model overview<\/h2>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p>We develop ELaTE based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"476\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-1024x476.png\" alt=\"Model Overview\" class=\"wp-image-1007478\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-1024x476.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-300x140.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-768x357.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-1536x714.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9-240x112.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Overview-v9.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center is-style-default\" id=\"audio-samples\">Audio samples<\/h2>\n\n\n\n<p>Below, we included audio samples demonstrating how ELaTE performs with various laughter instructions. The speech samples were taken from LibriSpeech test-clean and DiariST-AliMeeting dataset. The speech samples below are provided for the sole purpose of illustrating ELaTE.<\/p>\n\n\n\n<div style=\"width: 100%;margin: 0 auto\">\n    <!-- Instruction by time -->\n    <div style=\"margin-bottom: 50px\">\n        <h3 style=\"text-align: left\">Instruction by time<\/h3>\n        <p style=\"text-align: left\">ELaTE synthesizes speech in the voice characteristic specified by a speaker prompt, adding laughter at the specified timing. <\/p>\n        <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n        <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n            <table style=\"width: 100%;border-collapse: collapse;border: none\">\n                <thead>\n                    <tr style=\"border-bottom: 2px solid black\">\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Text prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Speaker prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Laughter prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Generated speech<\/th>\n                    <\/tr>\n                <\/thead>\n                <tbody>\n            <tr>\n                <td style=\"text-align: left;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">That&#8217;s funny!<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_prompt_1_A.wav\"><\/audio>\n                <\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">0.0&#8211;1.4 sec (first half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_first_1_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">1.4&#8211;2.8 sec (last half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_thats-funny-last-1.4-2.8.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">No laughter<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_1_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: left;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">I didn&#8217;t see that one coming!<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_prompt_2_A.wav\"><\/audio>\n                <\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">0.0&#8211;2.0 sec (first half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_first_2_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">2.0&#8211;4.0 sec (last half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_2_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">No laughter<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_nolaugh_2_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: left;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">I&#8217;m not sure whether to laugh or cry!<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_prompt_3_A.wav\"><\/audio>\n                <\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">0.0&#8211;2.2 sec (first half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_first_3_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">2.2&#8211;4.4 sec (last half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_3_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">No laughter<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_nolaugh_3_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: left;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">You&#8217;ve got to be kidding me!<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_prompt_4_A.wav\"><\/audio>\n                <\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">0.0&#8211;1.8 sec (first half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_first_4_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">1.8&#8211;3.6 sec (last half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_4_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">No laughter<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_nolaugh_4_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: left;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">Who let the dogs out?<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\" rowspan=\"3\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_prompt_5_A.wav\"><\/audio>\n                <\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">0.0&#8211;1.6 sec (first half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_first_5_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">1.6&#8211;3.2 sec (last half of speech)<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_last_5_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n            <tr>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">No laughter<\/td>\n                <td style=\"text-align: center;padding: 8px;border-bottom: 1px solid #ccc\">\n                    <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_nolaugh_5_A.wav\"><\/audio>\n                <\/td>\n            <\/tr>\n                <\/tbody>\n            <\/table>\n        <\/div>\n    <\/div>\n    <\/div>\n    <!-- Instruction by example -->\n    <div style=\"margin-bottom: 50px\">\n    <h3 style=\"text-align: left\">Instruction by example<\/h3>\n    <p style=\"text-align: left\">ELaTE synthesizes speech in the voice characteristic specified by a speaker prompt and incorporates the laughter style specified by a laughter prompt. <\/p>\n        <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n        <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n            <table style=\"width: 100%;border-collapse: collapse;border: none\">\n                <thead>\n                    <tr style=\"border-bottom: 2px solid black\">\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Text prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Speaker prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Laughter prompt<\/th>\n                        <th style=\"text-align: center;padding: 8px;width: 25%\">Generated speech<\/th>\n                    <\/tr>\n                <\/thead>\n                <tbody>\n                <tr>\n                    <td style=\"text-align: left;padding: 8px\">That&#8217;s funny!<\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_speakerprompt_1_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laughterprompt_1_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_1_B.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: left;padding: 8px\">That&#8217;s what she said!<\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_speakerprompt_2_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laughterprompt_2_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_2_B.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: left;padding: 8px\">I&#8217;ve heard of air guitar, but this is ridiculous! <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_speakerprompt_3_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laughterprompt_3_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_3_B.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: left;padding: 8px\">Well, that&#8217;s a plot twist!<\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_speakerprompt_4_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laughterprompt_4_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_4_B.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: left;padding: 8px\">I guess that&#8217;s one way to do it!<\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_speakerprompt_5_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laughterprompt_5_B.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_5_B.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <\/tbody>\n            <\/table>\n        <\/div>\n    <\/div>\n    <!-- Application for speech-to-speech translation -->\n    <div style=\"margin-bottom: 50px\">\n       <h3 style=\"text-align: left\">Application for speech-to-speech translation<\/h3>\n    <p style=\"text-align: left\">ELaTE can be applied to speech-to-speech translation, transferring not only the voice characteristic but also the precise nuance of the source audio.<\/p>\n        <div style=\"border-bottom: 2px solid black;margin-bottom: 2px\"><\/div>\n        <div style=\"background-color: #E6E6FA;padding: 20px;border-radius: 5px;max-width: 80%;margin: 20px auto\">\n            <table style=\"width: 100%;border-collapse: collapse;border: none\">\n                <thead>\n                <tr style=\"border-bottom: 2px solid black\">\n                    <th style=\"text-align: center;padding: 8px\" rowspan=\"2\">Source audio (Chinese)<\/th>\n                    <th style=\"text-align: center;padding: 8px\" colspan=\"3\">Translated audio (English)<\/th>\n                <\/tr>\n                <tr style=\"border-bottom: 2px solid black\">\n                    <th style=\"text-align: center;padding: 8px\">Seamless Expressive<\/th>\n                    <th style=\"text-align: center;padding: 8px\">Our baseline TTS<\/th>\n                    <th style=\"text-align: center;padding: 8px\">ELaTE<\/th>\n                <\/tr>\n                <\/thead>\n                <tbody>\n                <tr>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_1_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_seamless_1_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_baseline_1_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_1_C.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_4_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_seamless_4_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_baseline_4_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_4_C.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_3_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_seamless_3_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_baseline_3_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_3_C.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_5_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_seamless_5_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_baseline_5_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_5_C.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <tr>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_source_6_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_seamless_6_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_baseline_6_C.wav\"><\/audio>\n                    <\/td>\n                    <td style=\"text-align: center;padding: 8px\">\n                        <audio controls style=\"width: 150px\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/\/elate_laugh_6_C.wav\"><\/audio>\n                    <\/td>\n                <\/tr>\n                <\/tbody>\n            <\/table>\n        <\/div>\n    <\/div>\n    <!-- End of Application for Speech-to-Speech Translation -->\n\n<\/div>\n\n\n\n<p><em><sup>(*) The list of DiariST-AliMeeting laughter utterances we used for our evaluation, along with their transcription and translation, can be downloaded from<\/sup><\/em> <sup><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/DiariST-AliMeeting-Laughter-Test-Set.txt\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/DiariST-AliMeeting-Laughter-Test-Set.txt<\/a><\/sup> <em><sup>under CC BY-SA 4.0 <span style=\"font-size: 13.0591px\">lic<\/span>ense.<\/sup><\/em><br><em><sup>(**) We used Seamless Expressive for a pure research purpose. Seamless Expressive was used based on the Seamless Licensing Agreement. Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved.<\/sup><\/em><\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center is-style-default\" id=\"ethics-statement-1\">Ethics statement<\/h2>\n\n\n\n<p>ELaTE could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on. While ELaTE can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like ELaTE is a zero-shot text-to-speech (TTS) system that can generate natural laughing speech from any speaker based on a speaker prompt to mimic the voice characteristic, a text prompt to indicate the contents of the generated speech, and an input to control the laughter expression. ELaTE has [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13545],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1004841","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Xiaofei Wang","user_id":38658,"people_section":"Related people","alias":"xiaofewa"},{"type":"user_nicename","display_name":"Manthan Thakker","user_id":39627,"people_section":"Related people","alias":"mathakke"},{"type":"guest","display_name":"Canrun Li","user_id":583888,"people_section":"Related people","alias":""},{"type":"guest","display_name":"Chung-Hsien Tsai","user_id":1053960,"people_section":"Related people","alias":""},{"type":"guest","display_name":"Zhen Xiao","user_id":583885,"people_section":"Related people","alias":""},{"type":"guest","display_name":"Yanqing Liu","user_id":794366,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Sheng Zhao","user_id":41137,"people_section":"Related people","alias":"szhao"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1004841","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":173,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1004841\/revisions"}],"predecessor-version":[{"id":1053966,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1004841\/revisions\/1053966"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1004841"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1004841"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1004841"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1004841"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1004841"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}