Return to Blog Home
Microsoft Research Blog

Getting into a conversational groove: New approach encourages risk-taking in data-driven neural modeling


Microsoft Research’s Natural Language Processing group has set an ambitious goal for itself: to create a neural model that can engage in the full scope of conversational capabilities, providing answers to requests while also bringing the value of additional information relevant to the exchange and—in doing so—sustaining and encouraging further conversation.

Take the act of renting a car at the airport, for example. Across from you at the counter is the company representative, entering your information into the system, checking your driver’s license, and the like. If you’re lucky, the interaction isn’t merely a robotic back-and-forth; there is a social element that makes the mundane experience more enjoyable.

Spotlight: Academic programs

Working with the academic community

Read more about grants, fellowships, events and other ways to connect with Microsoft research.

“They might ask you where you’re going, and, you say the Grand Canyon. As they’re typing, they’re saying, ‘The weather’s beautiful out there today; it looks gorgeous,’” explained Microsoft Principal Researcher and Research Manager Bill Dolan. “We’re aiming for that kind of interaction, where pleasantries that are linked to the context, even if it’s a very task-oriented context, are not just appropriate, but in many situations, making the conversation feel fluid and human.”

As is the case with many goals worth pursuing, there are obstacles. Existing end-to-end data-driven neural networks have proven highly effective in generating conversational responses that are coherent and relevant, and Microsoft has been at the forefront of the rapid progress that has been made, the first to publish in the space of data-driven approaches to modeling conversational responses back in 2010. But these neural models present two particularly large challenges: They tend to produce very bland, vague outputs—hallmarks of stale conversation and nonstarters if the goal is user engagement beyond the completion of singular tasks—and they take a top-level either-or approach, classifying inputs as either task-oriented or conversational and assigning to each a specific path in the code base that fails to account for the nuances of the other. The result? Responses to more sophisticated conversation that can often be uninformative if varied—for example, “I haven’t a clue” and “I couldn’t tell you”—or they may be informative but not specific enough—such as “I like music” versus “I like jazz”—a result of traditional generation strategies that try to maximize the likelihood of the response.

The paper the team is presenting at the 2018 Conference on Neural Information Processing Systems (NeurIPS)—“Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization”—tackles the former challenge, introducing a new approach to producing more engaging responses that was inspired by the success of adversarial training techniques in such areas as image generation.

“Ideally, we would like to have the systems generate informative responses that are relevant and fully address the input query,” said leading author Yizhe Zhang. “By the same token, we also would like to promote responses that are more varied and less conventionally predictable, something that would help make conversations seem more natural and humanlike.”

“This work is focused on trying to force these modeling techniques to innovate more and not be so boring, to not be the person you’re desperately trying to avoid at the party,” added Dolan.

The force of two major algorithmic components

To accomplish this, the team determined it needed to generate responses that reduce the uncertainty of the query. In other words, the system needed to be better able to guess from the response what the original query might have been, reducing the chance that the system would produce bland outputs such as “I don’t know.”

In the paper, Zhang, Dolan, and their collaborators introduce adversarial information maximization (AIM). Designed to train end-to-end neural response generation models that produce conversational responses that are both informative and diverse, this new approach combines two major algorithmic components: generative adversarial networks (GANs) to encourage diversity and variational information maximization objective (VIMO) to produce informative responses.

“This adversarial training technique has received great success in generating very diverse and realistic-looking synthetic data when it comes to image creation,” said Zhang, who began this work as a Microsoft Research intern while at Duke University and is now a researcher with the company. “It’s been less explored in the text domain because of the discrete nature of text, and we were inspired to see how it could help with natural language processing, especially in dialogue generation.”

GANs themselves are increasingly deployed in neural response and commonly use synthetic data during training. Equilibrium for the GAN objective is achieved when the synthetic data distribution matches the real data distribution. This has the effect of discouraging the generation of responses that demonstrate less variation than human responses. While this may help reduce the level of blandness, however, the GAN technique was not developed for the purpose of explicitly improving either informativeness or diversity. That is where VIMO comes in.

Going backward to move forward

The team trained a backward model that generates the query, or source, from the response, or target. The backward model is then used to guide the forward model—from query to response—to generate relevant responses during training, providing a principled approach to mutual information maximization. This work is the first application of a variational mutual information objective in text generation.

The authors also employed a dual adversarial objective that composes both source-to-target and target-to-source objectives. The dual objective requires the forward and backward model to work synergistically, and each improves the other.

To mitigate the well-known instability in training GAN models, the authors—inspired by the deep structured similarity model—applied an embedding-based discriminator rather than the binary classifier that is conventionally used in GAN training. To reduce the variance of gradient estimation, they used a deterministic policy gradient algorithm with a discrete approximation strategy.

The paper advances the team’s focus on improving ranking candidate hypotheses to push the system to take more risks and produce more interesting outputs.

“In ranking the candidate hypotheses, you might have hundreds and thousands of hypotheses that it’s trying to weigh, and the very top-ranked ones might be these really bland-type ones,” explained Dolan. “If you look down at candidate No. 2,043, it might have a lot of content words, but be wrong and completely odd in context even though it’s aggressively contentful. Go down a little farther, and maybe you find a candidate that’s contentful and appropriate in context.”

Persona non grata

Solving the fundamental problem of uninteresting and potentially uninformative outputs in today’s modeling techniques is an important pursuit, as it’s a significant obstacle in creating conversational agents that individuals will want to engage with regularly in their everyday lives. Without interesting and useful outputs, conversations, task-oriented or not, will quickly spiral into the trivial unless the user is continuously voicing keywords. In that way, current neural models are very reactive, requiring a lot of work from the user, and that can be frustrating and exhausting.

“It’s not that tempting to engage with these agents even though they sound, superficially, fluent as if they understand you, because they tend not to innovate in the conversation,” said Dolan.

Conversation generation stands to gain a lot from this work, but so do other tasks involving language and neural models, such as video and photo captioning or text summarization, let’s say of a spreadsheet you’re working in.

“You don’t want a generated spreadsheet caption that is just, ‘Lines are going up. Numbers are all over the place,’” said Dolan. “You actually need it to be contentful and tie to the context in interesting ways, and that’s at odds with the tendency of current neural modeling techniques.”

The team can envision a future in which exchanges with conversational agents are comparable to those with friends, an exploratory process in which you’re asking for an opinion, unsure of where the conversation will lead.

“You can use our system to improve that, to produce more engaging and interesting dialogue; that’s what this is all about,” said Zhang.