{"id":480744,"date":"2018-04-23T11:47:49","date_gmt":"2018-04-23T18:47:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=480744"},"modified":"2018-05-07T13:47:05","modified_gmt":"2018-05-07T20:47:05","slug":"chatpainter-improving-text-image-generation-using-dialogue","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/chatpainter-improving-text-image-generation-using-dialogue\/","title":{"rendered":"ChatPainter: Improving text-to-image generation by using dialogue"},"content":{"rendered":"<p>Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/scirate.com\/arxiv\/1802.06454\">enable bots to draw<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> realistic images in defined domains, such as birds, faces or furniture. However, because there is a limited amount of annotated paired image-caption data available, models have difficulty understanding the correspondence between words in the caption to both objects and their interactions. In this new area of research,\u00a0we explore ways to generate images from text that references several objects, such as \u201cA fire truck stopped in the middle of a quiet street while people pass by on the sidewalk\u201d using dialogue.<\/p>\n<p>A team of researchers from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/mila.quebec\/en\/\">Montreal Institute for Learning Algorithms (MILA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> at the University of Montreal and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-montreal\/\">Microsoft Research Montreal (MSR Montreal)<\/a>, took inspiration from how sketch artists draw a sketch while conversing with a person who is describing a scene. They hypothesized that giving the bot feedback, in addition to the text, in the form of a dialogue, would help the generation process.\u00a0 For example, the feedback could discuss details about the objects in the caption or even objects not present in the caption.<\/p>\n<p>They tested this hypothesis by pairing images and captions from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/cocodataset.org\/\">Microsoft COCO dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [1] with dialogues for these same images from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/visualdialog.org\/\">Visual Dialog dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [2]. The dialogues in the Visual Dialog dataset were collected by pairing people. The person playing the role of an \u2018<em>answerer<\/em>\u2019 had access to the image and its caption and had to answer questions about the image. The person playing the role of the \u2018<em>questioner<\/em>\u2019 had access only to the image\u2019s caption. The questioner had to ask questions to be able to imagine the scene more clearly.<\/p>\n<p>Using the Visual Dialog dataset as an approximation to the sketch artist scenario, the team tested their hypothesis. They observed that conditioning on dialogues helped existing models, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1612.03242\">StackGAN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [3], to generate higher quality images than the same model architecture conditioned only on captions.<\/p>\n<div id=\"attachment_484968\" style=\"width: 1034px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-484968\" class=\"wp-image-484968 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/ImageGrid.jpg\" alt=\"Some images drawn by the ChatPainter model when given a caption and a dialogue.\" width=\"1024\" height=\"470\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/ImageGrid.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/ImageGrid-300x138.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/ImageGrid-768x353.jpg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-484968\" class=\"wp-caption-text\">Some images drawn by the ChatPainter model when given a caption and a dialogue<\/p><\/div>\n<div id=\"attachment_484965\" style=\"width: 1034px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-484965\" class=\"wp-image-484965 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/SurfBoard-questions.png\" alt=\"Image generated by the ChatPainter model for a given caption and dialogue\" width=\"1024\" height=\"333\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/SurfBoard-questions.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/SurfBoard-questions-300x98.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/SurfBoard-questions-768x250.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-484965\" class=\"wp-caption-text\">Image generated by the ChatPainter model for a given caption and dialogue<\/p><\/div>\n<p>While there is still a long way to go before models can generate realistic images of such complexity, this research represents significant improvement over previous approaches. The team from MSR Montreal believes that in the near future, it will be possible to have conversations with a bot that can generate an image someone has in mind and iteratively refine it from feedback received in the dialogue.\u00a0This could be useful in animation, interior design, painting and photo refinement among other areas.<\/p>\n<p>The team of researchers is comprised of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shsh\/\">Shikhar Sharma<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/saebrahi\/\">Samira Ebrahimi Kahou<\/a> from MSR Montreal, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/suhubdy.com\/research\/index.php\/about-me\/\">Dendi Suhubdy<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www-etud.iro.umontreal.ca\/~michals\/\">Vincent Michalski<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.iro.umontreal.ca\/~bengioy\/yoshua_en\/index.html\">Yoshua Bengio<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from MILA.<\/p>\n<p>Read the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1802.08216\">research paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> describing the ChatPainter model. To the best of our knowledge, this is the first public research paper to generate images from dialogue data.<\/p>\n<p><strong>References<\/strong><\/p>\n<p>[1] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., &#8230; & Zitnick, <em>C. L. Microsoft coco: Common objects in context. In European conference on computer vision<\/em>. Springer, Cham, 2014.<\/p>\n<p>[2] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., &#8230; & Batra, D. <em>Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.<\/em> Volume 2. 2017.<\/p>\n<p>[3] Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. <em>Stackgan<\/em><em>: Text to photo-realistic image synthesis with stacked generative adversarial networks.<\/em> IEEE Int. Conf. Comput. Vision. 2017.<\/p>\n<p><strong>Related<\/strong><\/p>\n<ul>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/blogs.microsoft.com\/ai\/drawing-ai\/\">Microsoft researchers build a bot that draws what you tell it to<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/scirate.com\/arxiv\/1802.06454\">DA-GAN: Instance-level Image Translation by Deep Attention Generative Adversarial Networks (with Supplementary Materials)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to enable bots to draw realistic images in defined domains, such as [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":481878,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[241770,194480,194481],"tags":[],"research-area":[13556,13562,13551],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-480744","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-graphics-and-multimedia","category-human-centered-computing","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-graphics-and-multimedia","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[437514],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"480\" height=\"280\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/04\/ChatPainter_AI_Carosel_04_2018_480x280.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/04\/ChatPainter_AI_Carosel_04_2018_480x280.jpg 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/04\/ChatPainter_AI_Carosel_04_2018_480x280-300x175.jpg 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/>","byline":"","formattedDate":"April 23, 2018","formattedExcerpt":"Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to enable&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/480744","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=480744"}],"version-history":[{"count":12,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/480744\/revisions"}],"predecessor-version":[{"id":484980,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/480744\/revisions\/484980"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/481878"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=480744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=480744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=480744"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=480744"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=480744"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=480744"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=480744"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=480744"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=480744"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=480744"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=480744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}