Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

ChatPainter: Improving text-to-image generation by using dialogue

April 23, 2018 | By Microsoft blog editor

Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to enable bots to draw realistic images in defined domains, such as birds, faces or furniture. However, because there is a limited amount of annotated paired image-caption data available, models have difficulty understanding the correspondence between words in the caption to both objects and their interactions. In this new area of research, we explore ways to generate images from text that references several objects, such as “A fire truck stopped in the middle of a quiet street while people pass by on the sidewalk” using dialogue.

A team of researchers from the Montreal Institute for Learning Algorithms (MILA) at the University of Montreal and Microsoft Research Montreal (MSR Montreal), took inspiration from how sketch artists draw a sketch while conversing with a person who is describing a scene. They hypothesized that giving the bot feedback, in addition to the text, in the form of a dialogue, would help the generation process.  For example, the feedback could discuss details about the objects in the caption or even objects not present in the caption.

They tested this hypothesis by pairing images and captions from the Microsoft COCO dataset [1] with dialogues for these same images from the Visual Dialog dataset [2]. The dialogues in the Visual Dialog dataset were collected by pairing people. The person playing the role of an ‘answerer’ had access to the image and its caption and had to answer questions about the image. The person playing the role of the ‘questioner’ had access only to the image’s caption. The questioner had to ask questions to be able to imagine the scene more clearly.

Using the Visual Dialog dataset as an approximation to the sketch artist scenario, the team tested their hypothesis. They observed that conditioning on dialogues helped existing models, such as StackGAN [3], to generate higher quality images than the same model architecture conditioned only on captions.

Some images drawn by the ChatPainter model when given a caption and a dialogue.

Some images drawn by the ChatPainter model when given a caption and a dialogue

Image generated by the ChatPainter model for a given caption and dialogue

Image generated by the ChatPainter model for a given caption and dialogue

While there is still a long way to go before models can generate realistic images of such complexity, this research represents significant improvement over previous approaches. The team from MSR Montreal believes that in the near future, it will be possible to have conversations with a bot that can generate an image someone has in mind and iteratively refine it from feedback received in the dialogue. This could be useful in animation, interior design, painting and photo refinement among other areas.

The team of researchers is comprised of Shikhar Sharma and Samira Ebrahimi Kahou from MSR Montreal, and Dendi Suhubdy, Vincent Michalski and Yoshua Bengio from MILA.

Read the research paper describing the ChatPainter model. To the best of our knowledge, this is the first public research paper to generate images from dialogue data.


[1] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … & Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, Cham, 2014.

[2] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., … & Batra, D. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Volume 2. 2017.

[3] Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. IEEE Int. Conf. Comput. Vision. 2017.



Up Next

Artificial intelligence, Computer vision

A picture from a dozen words – A drawing bot for realizing everyday scenes—and even stories

If you were asked to draw a picture of several people in ski gear, standing in the snow, chances are you’d start with an outline of three or four people reasonably positioned in the center of the canvas, then sketch in the skis under their feet. Though it was not specified, you might decide to […]

Microsoft blog editor

Bei Liu and Jianlong Fu of Microsoft research

Computer vision, Graphics and multimedia

The poet in the machine: Auto-generation of poetry directly from images through multi-adversarial training – and a little inspiration

As the means of expressing thoughts and feelings too sublime and elusive to be conveyed in everyday language, poetry across all cultures occupies a most sophisticated and mysterious of realms, just beyond the outskirts of creativity. Along the language-based avenues of expression available to people, poetry represents a departure from casual speech; few possess the […]

Microsoft blog editor

Artificial intelligence

TextWorld: A learning environment for training reinforcement learning agents, inspired by text-based games

Today, fresh out of the Microsoft Research Montreal lab, comes an open-source project called TextWorld. TextWorld is an extensible Python framework for generating text-based games. Reinforcement learning researchers can use TextWorld to train and test AI agents in skills such as language understanding, affordance extraction, memory and planning, exploration and more. Researchers can study these […]

Wendy Tay

Program Manager