GODEL: Combining goal-oriented dialog with real-world conversations


By , Senior Researcher , Senior Principal Researcher , Principal Research Software Engineer Manager , Principal Researcher , Assistant Professor , Distinguished Scientist & Vice President

Diagram showing GODEL’s architecture. The environment of the dialog system consists of both structured and unstructured content, which it uses to retrieve information. This source content, which we term “grounding,” is updated and repeatedly used by GODEL to produce a new response after each user input.

They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?   

With GODEL (opens in new tab), this may not be far off. GODEL stands for Grounded Open Dialogue Language Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.  

Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.  

In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.    

In our paper, “GODEL: Large-Scale Pre-training for Goal-Directed Dialog (opens in new tab),” we describe the technical details underlying GODEL, and we have made the code available on GitHub (opens in new tab)

A grounded model

One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful. 

In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.

Broad application of GODEL

Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.  

In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases. 

Two bar graphs showing that GODEL outperforms the baseline, in terms of both human and automated dialog evaluation. For human evaluation, GODEL received much higher human ratings (47, 41, and 27), while the human ratings for the best baseline were low (30, 22, and 17). For automatic evaluation, differences are smaller yet still statistically significant.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.

The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries. 

  • This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track. 

    Figure showing how GODEL responds to a user who just changed the topic, demonstrating that it can bring the conversation back on track. While the initial query is about a restaurant, the user suddenly mentions a series of tornadoes that have recently affected the area. GODEL uses grounding from a recent news article to provide information about the tornadoes, as requested by the user. Finally, it asks the user if there is anything else it can help with.
  • This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.  

    Figure showing how GODEL responds appropriately to a user's request for a restaurant reservation. The user expresses a preference for a restaurant named Lucky Star, and GODEL extracts information from a database about that restaurant and retrieves relevant information, such as a reference number, to generate a response that flows naturally with the rest of the conversation.
  • This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.  

    Figure showing how GODEL responds appropriately to a user's request for information about a specific restaurant. The user asks whether a given restaurant is good for groups, and GODEL uses text originating from restaurant reviews to infer that the restaurant is indeed good for groups. Also, GODEL provides additional information to address a concern with larger groups—that food is typically served quickly.
  •  This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web. 

    Figure showing how GODEL responds appropriately when asked to give an example of a popular Chinese dish. GODEL uses grounding originating from search results to respond to the question while focusing on the most relevant information of the retrieved document.

GODEL available as open source

To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source (opens in new tab). We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.

We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.


We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro. 

Related publications

Continue reading

See all blog posts