{"id":604773,"date":"2019-08-29T10:41:21","date_gmt":"2019-08-29T17:41:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=604773"},"modified":"2019-08-29T10:54:20","modified_gmt":"2019-08-29T17:54:20","slug":"microsoft-icecaps-an-open-source-toolkit-for-conversation-modeling","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-icecaps-an-open-source-toolkit-for-conversation-modeling\/","title":{"rendered":"Microsoft Icecaps: An open-source toolkit for conversation modeling"},"content":{"rendered":"<p>How we act, including how we speak, is more often than not determined by the situation we find ourselves in. We wouldn\u2019t necessarily use the same tone and language with friends during a night out bowling as we would with colleagues during an office meeting. We tailor dialogue to appropriately fit the scenario. If trained conversational agents are to continue evolving into dependable resources people can turn to for assistance, they\u2019ll need to be trained to do the same.<\/p>\n<p>Today, we\u2019re excited to make available the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/microsoft-icecaps\/\">Intelligent Conversation Engine: Code and Pre-trained Systems, or Microsoft I<span style=\"font-variant: small-caps;\">cecaps<\/span><\/a>, a new open-source toolkit that not only allows researchers and developers to imbue their chatbots with different personas, but also to incorporate other natural language processing features that emphasize conversation modeling.<\/p>\n<p>I<span style=\"font-variant: small-caps;\">cecaps<\/span> provides an array of capabilities from recent conversation modeling literature. Several of these tools were driven by recent work done here at Microsoft Research, including personalization embeddings, maximum mutual information\u2013based decoding, knowledge grounding, and an approach for enforcing more structure on shared feature representations to encourage more diverse and relevant responses. Our library leverages TensorFlow in a modular framework designed to make it easy for users to construct sophisticated training configurations using multi-task learning. In the coming months, we\u2019ll equip I<span style=\"font-variant: small-caps;\">cecaps<\/span> with pre-trained conversational models that researchers and developers can either use directly out of the box or quickly adapt to new scenarios by bootstrapping their own systems.<\/p>\n<h3>Multi-task learning and SpaceFusion<\/h3>\n<p>At I<span style=\"font-variant: small-caps;\">cecaps<\/span>\u2019 core is a flexible multi-task learning paradigm. In multi-task learning, a subset of parameters is shared among multiple tasks so those tasks can make use of shared feature representations. For example, this technique has been used in conversational modeling to <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/multi-task-learning-speaker-role-based-neural-conversation-models\/\">combine general conversational data with unpaired utterances<\/a>; by pairing a conversational model with an autoencoder that shares its decoder, one can use the unpaired data to personalize the conversational model. I<span style=\"font-variant: small-caps;\">cecaps<\/span> enables multi-task learning by representing most models as chains of components and allowing researchers and developers to build arbitrarily complex configurations of models with shared components. Flexible multi-task training schedules are also supported, allowing users to alter how tasks are weighted over the course of training.<\/p>\n<div id=\"attachment_606045\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-606045\" class=\"wp-image-606045 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909-1024x317.png\" alt=\"In a multi-task learning environment, paired and unpaired data can be combined during training.\" width=\"1024\" height=\"317\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909-1024x317.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909-300x93.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909-768x238.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/ICECAPS_Social_08_2019_1200x627_final-e1567034017909.png 1201w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-606045\" class=\"wp-caption-text\">In a multi-task learning environment, paired and unpaired data can be combined during training.<\/p><\/div>\n<p>I<span style=\"font-variant: small-caps;\">cecaps<\/span> additionally implements <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/jointly-optimizing-diversity-and-relevance-in-neural-response-generation-2\/\">SpaceFusion<\/a>, a specialized multi-task learning paradigm originally designed to jointly optimize for diversity and relevance of generated responses. SpaceFusion adds regularization terms to shape the latent space shared among tasks. These terms better align the distributions learned by each task over this latent space.<\/p>\n<div id=\"attachment_604776\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-604776\" class=\"wp-image-604776 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png-1024x565.jpg\" alt=\"SpaceFusion adds regularization terms to a multi-task learning environment, imposing structure upon the shared latent space to improve efficiency.\" width=\"1024\" height=\"565\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png-1024x565.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png-300x165.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png-768x424.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_2.png.jpg 1329w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-604776\" class=\"wp-caption-text\">SpaceFusion adds regularization terms to a multi-task learning environment, imposing structure upon the shared latent space to improve efficiency.<\/p><\/div>\n<h3>Personalization<\/h3>\n<p>To achieve personalization in conversational scenarios where an AI may be required to adopt some persona with its own particular style and attributes, I<span style=\"font-variant: small-caps;\">cecaps<\/span> allows researchers and developers to <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/persona-based-neural-conversation-model\/\">train multi-persona conversation systems on multi-speaker data using personality embeddings<\/a>. Personality embeddings work similarly to word embeddings; just as we learn an embedding for each word to describe how words relate to each other within a latent word space, we can learn an embedding per speaker from a multi-speaker dataset to describe a latent personality space. Multi-persona encoder-decoder models provide the decoder a personality embedding alongside word embeddings to condition the decoded response on the selected personality.<\/p>\n<div id=\"attachment_604782\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_3.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-604782\" class=\"wp-image-604782 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_3-1024x445.png\" alt=\"By combining a word embedding space with a persona embedding space, personalized sequence-to-sequence models enable personalized response generation.\" width=\"1024\" height=\"445\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_3-1024x445.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_3-300x130.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/icecaps_figure_3-768x334.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-604782\" class=\"wp-caption-text\">By combining a word embedding space with a persona embedding space, personalized sequence-to-sequence models enable personalized response generation.<\/p><\/div>\n<h3>MMI-based decoding<\/h3>\n<p>Conversational systems trained on noisy real-world data tend to produce nonspecific, bland responses such as \u201cI don\u2019t know what you\u2019re talking about.\u201d These systems learn this behavior as a safe way to consistently produce context-appropriate responses. The cost is response diversity and content. One method to tackle this issue is <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-diversity-promoting-objective-function-for-neural-conversation-models\/\">hypothesis reranking based on maximum mutual information (MMI)<\/a>. This approach trains a second model to predict the context given a potential response. This model assigns an additional score to each hypothesis generated by the base decoder, and this additional score is used to rerank the set of hypotheses. MMI takes the potential responses most targeted toward the given context and pushes them to the top of the list. I<span style=\"font-variant: small-caps;\">cecaps<\/span> incorporates MMI-based reranking, among several other decoding features, as part of its custom beam search decoder.<\/p>\n<h3>Knowledge grounding<\/h3>\n<p>One of the major bottlenecks in training conversational systems is a lack of conversational data that captures the richness of information present in the abundance of non-conversational data that exists in the world. We therefore need good tools that can take advantage of the latter. To train an intelligent agent endowed with all the knowledge contained within Wikipedia or other encyclopedic sources, for instance, I<span style=\"font-variant: small-caps;\">cecaps<\/span> implements <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/conversing-by-reading-contentful-neural-conversation-with-on-demand-machine-reading\/\">an approach to knowledge-grounded conversation that combines machine reading comprehension and response generation modules<\/a>. The model uses attention to isolate content from the knowledge source relevant to the context, allowing the model to produce more informed responses.<\/p>\n<div id=\"attachment_606015\" style=\"width: 910px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-606015\" class=\"wp-image-606015\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-1024x578.png\" alt=\"Cross-attention can be used to extract pertinent information from an external knowledge base for shaping generated responses.\" width=\"900\" height=\"508\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-1024x578.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-768x433.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/08\/Icecaps_Stand-Alone-Image_08_2019_2638x1489-640x360.png 640w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><p id=\"caption-attachment-606015\" class=\"wp-caption-text\">Cross-attention can be used to extract pertinent information from an external knowledge base for shaping generated responses.<\/p><\/div>\n<h3>Follow us!<\/h3>\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/icecaps\">Follow our GitHub page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>! You will receive updates as we add pre-trained systems, new natural language processing features, and tutorials. Informed personalized chatbots are only the beginning for conversational modeling; promising new areas of research include content filtering, multi-lingual modeling, and hybridizing conversational and task-oriented capabilities. We care about advancing the field of conversational modeling, and with I<span style=\"font-variant: small-caps;\">cecaps<\/span>, our goal is to empower researchers and developers to push the cutting edge.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How we act, including how we speak, is more often than not determined by the situation we find ourselves in. We wouldn\u2019t necessarily use the same tone and language with friends during a night out bowling as we would with colleagues during an office meeting. We tailor dialogue to appropriately fit the scenario. If trained [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":606249,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Vighnesh Leonardo Shiv","user_id":"38166"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[243622],"tags":[],"research-area":[13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-604773","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-human-language-technologies","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[604608],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/ICECAPS_Site_08_2019_1400x788.png 1401w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Vighnesh Leonardo Shiv","formattedDate":"August 29, 2019","formattedExcerpt":"How we act, including how we speak, is more often than not determined by the situation we find ourselves in. We wouldn\u2019t necessarily use the same tone and language with friends during a night out bowling as we would with colleagues during an office meeting.&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/604773","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=604773"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/604773\/revisions"}],"predecessor-version":[{"id":606177,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/604773\/revisions\/606177"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/606249"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=604773"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=604773"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=604773"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=604773"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=604773"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=604773"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=604773"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=604773"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=604773"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=604773"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=604773"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}