{"id":953901,"date":"2023-07-10T12:10:56","date_gmt":"2023-07-10T19:10:56","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=953901"},"modified":"2026-02-05T06:41:47","modified_gmt":"2026-02-05T14:41:47","slug":"project-rumi","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-rumi\/","title":{"rendered":"Project Rumi"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"720\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720.png\" class=\"attachment-full size-full\" alt=\"Project Rumi header - multimodal diagram on gradient background\" style=\"object-position: 51% 44%\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720.png 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-300x113.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-1024x384.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-768x288.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-1536x576.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-1600x600.png 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/project-rumi-header_color_1920x720-240x90.png 240w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"project-rumi\">Project Rumi<\/h1>\n\n\n\n<p>Multimodal paralinguistic prompting for large language models<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"multimodal-paralinguistic-prompting-for-large-language-models\">Multimodal paralinguistic prompting for large language models<\/h2>\n\n\n\n<p>Large language models (LLMs) are powerful neural networks that generate natural language with demonstrated value across a wide range of domains. They are revolutionizing many aspects of human society and culture including introducing a \u201cwhole new interaction model between humans and computers, turning natural language into the most powerful productivity tool on the planet.\u201d<sup>1<\/sup><\/p>\n\n\n\n<p>However, LLMs also have limitations: they may not always understand the context and nuances of a conversation. Their performance also depends on the quality and specificity of the user\u2019s input, or prompt. The data that the user inputs into the LLM is a lexical entry, which does not comprehensively represent the nuances of human-to-human interaction; it is in fact missing all the <strong>paralinguistic<\/strong> information (intonation, gestures, facial expressions, and everything besides the actual words) that contribute to the meaning and intentions of the speaker. This can lead to misinterpretation, misunderstanding, or inappropriate responses from the LLM. Project Rumi incorporates paralinguistic input into prompt-based interactions with LLMs with the objective of improving the quality of communication. Providing this context is critical to enhancing LLMs capabilities in this \u201cAI as a copilot\u201d era.<\/p>\n\n\n\n<p>Our current system leverages separately trained vision and audio-based models to detect and analyze non-verbal cues extracted from data streams. The models assess sentiment from cognitive and physiological data in real time, generating appropriate paralinguistic tokens to augment standard lexical prompt input to existing LLMs such as GPT4. &nbsp;This multimodal, muti-step architecture integrates seamlessly with all pretrained text-based LLMs to provide additional information on the user&#8217;s sentiment and intention that is not captured by text-based models, augmenting the prompt with the richness and subtlety of human communication to bring human-AI interaction to a new level.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1033\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400.png\" alt=\"Diagram illustrating user interaction with application. Inputs user\u2019s sensor data such as video and audio and uses processed sensor data to generate paralinguistic classification. This classification augments the input to the LLM, generating an augmented output to application.\" class=\"wp-image-958200\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400-300x221.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400-1024x756.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400-768x567.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/07\/Rumi-system-diagram_UPDATED_July31_1400-240x177.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Diagram illustrating user interaction with application. Inputs user\u2019s sensor data such as video and audio and uses processed sensor data to generate paralinguistic classification. This classification augments the input to the LLM, generating an augmented output to application.<\/figcaption><\/figure>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Future explorations include improving performance of existing models and incorporating additional signals like HRV (heart rate variability) derived from standard video, and cognitive and ambient sensing. Conveying unspoken meaning and intention is an essential component in the next generation of AI interaction.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"hKnOsfKrKbE\" data-poster=\"https:\/\/img.youtube.com\/vi\/hKnOsfKrKbE\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Multimodal Paralinguistic Prompting for Large Language Models (Project Rumi) Demo\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/hKnOsfKrKbE?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div><\/div>\n<\/div><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/worklab\/ai-a-whole-new-way-of-working\" target=\"_blank\" rel=\"noreferrer noopener\">Explore AI: A whole new way of working<\/a><\/div>\n<\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Project Rumi incorporates paralinguistic input into prompt-based interactions with LLMs with the objective of improving the quality of communication. Providing this context is critical to enhancing LLMs capabilities in this \u201cAI as a copilot\u201d era.<\/p>\n","protected":false},"featured_media":954165,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562,13554],"msr-locale":[268875],"msr-impact-theme":[264846,261667],"msr-pillar":[],"class_list":["post-953901","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-human-computer-interaction","msr-locale-en_us","msr-archive-status-complete"],"msr_project_start":"","related-publications":[958476],"related-downloads":[],"related-videos":[],"related-groups":[901101],"related-events":[],"related-opportunities":[],"related-posts":[957600],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Weiwei Yang","user_id":40138,"people_section":"Section name 0","alias":"weiwya"},{"type":"user_nicename","display_name":"Spencer Fowers","user_id":33581,"people_section":"Section name 0","alias":"sfowers"},{"type":"user_nicename","display_name":"David Tittsworth","user_id":38064,"people_section":"Section name 0","alias":"datittsw"},{"type":"user_nicename","display_name":"Amber Hoak","user_id":37992,"people_section":"Section name 0","alias":"amhoak"},{"type":"user_nicename","display_name":"Thiago Vallin Spina","user_id":42246,"people_section":"Section name 0","alias":"tvallinspina"},{"type":"user_nicename","display_name":"Kate Lytvynets","user_id":38073,"people_section":"Section name 0","alias":"kalytv"},{"type":"user_nicename","display_name":"Christopher O'Dowd","user_id":42762,"people_section":"Section name 0","alias":"chriod"},{"type":"user_nicename","display_name":"Andrea Britto Mattos Lima","user_id":42393,"people_section":"Section name 0","alias":"andreabri"},{"type":"user_nicename","display_name":"Whitney Hudson","user_id":42513,"people_section":"Section name 0","alias":"whhudson"},{"type":"guest","display_name":"Prachi Patel","user_id":962415,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Robert Morain","user_id":962418,"people_section":"Section name 0","alias":""}],"msr_research_lab":[199565,1161007],"msr_impact_theme":["Computing foundations","Empowerment"],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/953901","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":17,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/953901\/revisions"}],"predecessor-version":[{"id":1161534,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/953901\/revisions\/1161534"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/954165"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=953901"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=953901"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=953901"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=953901"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=953901"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}