Microsoft Search, Assistant and Intelligence

Designing a Framework for Conversational Interfaces

Share this page

This is a guest post from our close partners, Semantic Machines

By Zachary Tellman

Conversational interfaces are an idea that is forever on the cusp of transforming the world. The potential is undeniable: everyone has innate, untapped conversational expertise. We could do away with the nested menus required by visual interfaces; anything the user can name is immediately at hand. We could turn natural language into a declarative scripting language, and operating systems into an IDE.

Reality, however, has not lived up to this potential. Most people’s use of the conversational agents in their phones and smart devices is limited to reminders and timers, if they use them at all. At Semantic Machines, however, we are creating a framework for conversational interfaces that we hope will unlock some of this potential. It’s currently powering a conversational interface in Outlook Mobile, with other products soon to follow.

To accomplish this, our framework combines some of the latest advances in machine learning with concepts and approaches dating back to the earliest days of artificial intelligence research. To understand why, we’ll first need to look back fifty years to one of the first — and still one of the most successful — conversational agents ever created.

A Detour Through History

In 1972, Terry Winograd published a paper entitled Understanding Natural Language which described a software project he had worked on in the late 1960s. It allowed users to direct a virtual robot arm, named SHRDLU, to interact with a world consisting of a table, a box, and a few blocks of varying shapes and colors. Users could carry on a conversation with SHRDLU, asking questions and giving instructions:

– Pick up a big red block.
OK.

– Find a block which is taller than the one you are holding and put it into the box.
BY “IT”, I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
OK.

– What does the box contain?
THE BLUE PYRAMID AND THE BLUE BLOCK.

– How many blocks are not in the box?
FOUR OF THEM.

– Is at least one of them narrower than the one which I told you to pick up?
YES, THE RED CUBE.

– Is it supported?
YES, BY THE TABLE.

– Can the table pick up blocks?
NO.

– Can a pyramid be supported by a block?
YES.

– Can a pyramid support a pyramid?
I DON’T KNOW.

– Stack up two pyramids.
(trying) I CAN’T.

Winograd’s project represents a pivotal point in the history of AI research. Earlier efforts were significantly more ambitious; Herbert Simon and Alan Newell’s “General Problem Solver,” introduced in 1958, was presented not just as a method for achieving human-like behavior, but also as a descriptive model for human cognition. As became the norm for early AI research, Simon and Newell reduced the problem to one of search. Given an initial state and a desired end state, the Solver would search through all possible sequences of actions until it found one that led to that end state. Since the branching factor of the search tree would be very high — you can, in most situations, do almost anything — the Solver would need to use heuristics (from the Greek heureka, as in “I’ve found it!”) to determine which actions were likely to be useful in a given situation.

Having described the engine for thought, all that remained was “knowledge engineering:” creating a repository of possible actions and relevant heuristics for all aspects of human life. This, unfortunately, proved harder than expected. As various knowledge engineering projects stalled, researchers focused on problem solving within “microworlds:” virtual environments where the state was easily represented, and the possible actions easily enumerated. Winograd’s microworld was the greatest ever created; SHRDLU’s mastery of its environment, and the subset of the English language that could be used to describe it, was self-evident.

Still, it wasn’t clear how to turn a microworld into something more useful; the boundaries of SHRDLU’s environment were relied upon at every level of its implementation. Hubert Dreyfus, a professor of philosophy and leading critic of early AI research, characterized these projects as “ad hoc solutions [for] cleverly chosen problems, which give the illusion of complex intellectual activity.” Ultimately, Dreyfus was proven right; every attempt to generalize or stitch together these projects failed.

What came next is a familiar story: funding for research dried up in the mid-1970s, marking the beginning of the AI Winter. After some failed attempts in the 1980s to commercialize past research by selling so-called “expert systems,” the field lay dormant for decades before the resurgence of the statistical techniques generally referred to as “machine learning.”

Generally, this era in AI research is seen as a historical curiosity; a group of researchers made wildly optimistic predictions about what they could achieve and failed. What could they possibly have to teach us? Surely it’s better to look forward to the bleeding edge of research than back at these abandoned microworlds.

We must acknowledge, however, the astonishing sophistication of Winograd’s SHRDLU when compared to modern conversational agents. These agents operate on a model called “slots and intents”, which is effectively Mad-Libs in reverse. Given some text from the user (the utterance), the system identifies the corresponding template (the intent), and then extracts out pieces of the utterance (the slots). These pieces are then fed into a function which performs the task associated with the intent.

If, for example, we had a function order_pizza(size, toppings), a slots-and-intents framework can easily provide a mapping between “order me a medium pizza with pepperoni and mushrooms” and order_pizza("medium", ["pepperoni", "mushrooms"]). It allows us to separate linguistic concerns from the actual business logic required to order a pizza. But consider the second utterance from the conversation with SHRDLU:

Find a block which is taller than the one you are holding and put it into the box.

This utterance is difficult to model as an intent for a number of reasons. It describes two actions, but since every intent maps onto a single function, we’d have to define a compound function find_block_and_put_into_box(...) and define similar functions for any other compound action we’d want to support. But even that’s not enough; if we simply call find_block_and_put_into_box("taller than the one you are holding"), we’re letting linguistic concerns bleed into the business logic. At most, we’d want the business logic to be interpreting individual words like “taller,” “narrower,” and so on, but that would require an even more specific function:

find_block_which_is_X_than_held_block_and_put_in_box("taller")

The problem is that natural language is compositional, while slots-and-intents frameworks are not. Rather than defining a set of primitives (“find a block,” “taller than,” “held block,” etc.) that can be freely combined, the developer must enumerate each configuration of these primitives they wish to support. In practice, this leads to conversational agents that are narrowly focused and easily confused.

Winograd’s SHRDLU, despite its limitations, was far more flexible. At Semantic Machines we are building a dialogue system that will preserve that flexibility, while avoiding most of the limitations. This post will explain, at a high level, how we’ve accomplished that feat. If you find this problem space or our approach interesting, you should consider working with us.

Plans

In our dialogue system, utterances are translated into small programs, which for historical reasons are called plans. Given the problematic utterance:

Find a block which is taller than the one you are holding and put it into the box.

Our planning model, which is a Transformer-based encoder-decoder neural network, will return something like this:

find_block((b: Block) => taller_than(b, held_block()))
put_in_box(the[Block]())

This is rendered in Express, an in-house language which is syntactically modeled after Scala. Notice that each symbol in the plan corresponds almost one-to-one with a part of the utterance, down to a special the() function which resolves what “it” refers to. This is because we only want the planning model to translate the utterance, not interpret it.

The reason for this isn’t immediately obvious; to most experienced developers, a function like taller_than would seem like an unnecessary layer of indirection. Why not just inline it?

find_block((b: Block) => b.height > held_block().height)

This indirection, however, is valuable. In a normal codebase, function names aren’t exposed; we can assign them any meaning we like, so long as it makes sense to other people on our team. Conversely, these functions are an interface between our system and the user, and so their meaning is defined by the user’s intent. Over time, that meaning is almost certain to become more nuanced. We may, for instance, realize that when people say “taller than,” they mean noticeably taller:

def taller_than(a: Block, b: Block) = (a.height - b.height) > HEIGHT_EPSILON

If we’ve maintained our layer of indirection, this is an easy one-line change to our function definition, and the training dataset for the planning model remains unchanged. If we’ve inlined the function, however, we have to carefully migrate our training dataset; we only want to update a.height > b.height where it corresponds to “taller than” in the utterance.

By focusing on translation, we keep our training data timeless, allowing our dataset to monotonically grow even as we tinker with semantics. By matching each natural language concept to a function, we keep our semantics explicit and consistent. This approach, however, assumes the meaning is largely context-independent. Our planning model is constrained by the language’s type system, so if the utterance doesn’t mention blocks it won’t use block-related functions, but otherwise we assume that “taller than” can always be translated into taller_than.

This, of course, is untrue for indefinite articles like “it,” “that,” or “them;” their meaning depends entirely on what was said earlier in the conversation. In our system, all such references are translated into a call to the(). This is possible because the Express runtime retains the full execution, including all intermediate results, of every plan in the current conversation. This data, stored as a dataflow graph, represents our conversational context: things which we’ve already discussed, and may want to later reference. Certain special functions, such as the(), can query that graph, searching for the expression which is being referenced.

In SHRDLU, these indefinite articles were resolved during its parse phase, which transformed utterances into its own version of a plan. Resolution, however, is not always determined by the grammatical structure of the utterance; sometimes we need to understand its semantics. Consider these two commands:

  • Put the red block beneath the green block, and the pyramid on top of it
  • Put the red block above the green block, and the pyramid on top of it

Common sense tells us that the pyramid should go on whichever block is above the other. To act on this common sense, SHRDLU had to abandon any meaningful separation of syntactic and semantic analysis, which explains, in part, why it was so hard to extend. In our system, resolution is driven by an entirely separate model, which uses syntactic heuristics where possible and domain-specific semantics where necessary. For most developers, however, it suffices to know that “it” and “that” translate into the().

Constraints

Notice that in the above plan we pass find_block a predicate with the criteria for the block we wish to find:

find_block((b: Block) => taller_than(b, held_block()))

This is because the user hasn’t told us which block they want, they only provided the criteria for finding it. This is called an intensional description, as opposed to an extensional description which specifies the actual entity or entities. In practice, every entity we reference in conversation is referenced intensionally; a reference to “Alice” would be translated into:

the[Person](p => p.name ~= "Alice")

where ~= means “similar to”. When executed, the() will try to find a person named Alice somewhere in the conversational history, but there’s no guarantee one exists. The user may assume that, given who they are, the system can figure out who they mean. Perhaps there’s a particular Alice that they work with, or someone in their family is named Alice. In either case, the user clearly thinks they’ve given us enough information, so we have to figure out what makes sense in the given context.

If the() fails to find a match in the conversational context, it will call a resolver function associated with the Person datatype. But how should a Person resolver, given a user-provided predicate, actually work? We can’t simply scan over a list of all the possible people and apply our predicate as a filter; that dataset lives elsewhere and is unlikely to be easily accessed. Because of both practical and privacy concerns, it will almost certainly be exposed via a service with access controls and a limited API.

Our resolver, then, must translate the predicate into one or more queries to backend services which provide information about people. To do that, we must stop thinking of it as a predicate and start thinking of it as a constraint.

Many developers have likely heard of SAT solvers, which given constraints on one or more boolean values will try to find satisfying assignments. Given a && !b, it will return a == true, b == false. Given a && !a, it will tell us that the constraint is unsatisfiable. Since a variety of problems can be mapped into this representation, SAT solvers are widely used. This capability is generalized by SMT solvers, which can solve more complex constraints on a wider variety of datatypes.

Neither kind of solver, however, has a way to specify “the value must correspond to an entity in a backend service.” Even if they did, we probably wouldn’t want to use it; we don’t want the solver to fire off dozens of queries similar to “Alice” to the backend service while searching through possible values. Only the domain developer building atop our dialogue system understands the capabilities and costs of their backend services. The query API for a service, for instance, might offer its own “similar to” operator. Their similarity metric, however, probably won’t reflect that some people use “Misha” and “Mikhail” interchangeably. The domain developer will have to maintain a balance between preserving the user’s intent and minimizing the number of requests they make per utterance.

Since we can’t fully interpret the constraint for the domain developers, we must provide them their own tools for interpretation. Domain functions which, like resolvers, interpret constraints are called controllers. In the current version of our system, controllers are typically written in TypeScript, since that language is likely to be a familiar and expressive way to write complex domain logic. Within the controller, predicates are transformed into constraint zippers, which allow them to traverse, query, and transform constraints on complex datatypes. For each field and sub-field, domain developers can ask various questions: are there lower or upper bounds? What is an example of a satisfying value? Is that the only satisfying value? Does this value satisfy the constraint?

This last question is crucial, because we won’t always be able to encode the entire constraint in our query to the backend service. The set of results we get back may be too broad, and therefore must be post-filtered using the constraint. Conversely, operators which correspond to query operators in the service’s API, like ~=, can be configured as abstract named properties. Upon navigating to Person.name, we can look for an abstract property of ~=, and examine its argument’s zipper to construct our query.

Early AI researchers envisioned a world where knowledge had a singular representation and a singular repository. Instead, we live in a world where data, and the ability to interpret it, is fragmented and diffuse. As a result, our constraint solver must be unusually extensible, allowing developers to compose it with their own systems and domain expertise.

Revision

A major challenge in interpreting a user’s intent is everything they leave unsaid. Stripped of any context, much of what we say is ambiguous. To interpret “I’m headed to the bank,” we need to know whether the speaker is near a river. In linguistics, the study of how context confers meaning is called pragmatics. Our dialogue system, then, needs to provide tools for developers to easily specify domain-specific pragmatics.

For example, if in Outlook Mobile a user says, “reschedule my meeting with Alice to next week,” we can reasonably assume they mean an upcoming meeting, because almost everything we do in our calendar focuses on upcoming events. If we believed this was always true, we could simply take every user intension about an event and further constrain it to start in the future:

def add_pragmatics(predicate: Event => Boolean): Event => Boolean = {
  e => predicate(e) && e.start > now()
}

But what if the user wants to reschedule a past meeting that was cancelled? If we apply the above function to “reschedule yesterday’s meeting with Alice to next week,” the event will be constrained to both be yesterday and in the future; the constraint will be unsatisfiable. We can’t, then, simply mix our default assumptions into whatever the user provides; we have to allow them to be selectively overridden, just like any other default value. Fortunately, we have a solution which is general across all domains:

def add_pragmatics(predicate: Event => Boolean): Event => Boolean = {
  revise(
    e => e.start > now(), 
    predicate,
  )
}

In our system, revise is a powerful operator that, given two constraints a and b, will discard the parts of a which keep b from being meaningful, and conjoin the rest onto b. Consider a query for “yesterday’s meeting”, where we revise some basic pragmatics with the user’s intension:

revise(
  e => e.start > now() && e.attendees.contains(me()), 
  e => e.start.date == yesterday(),
)

Our default assumptions are that the event being referenced starts in the future and will be attended by the user. The first clause of those defaults, however, contradicts the user’s intension. The result of our revision, then, will consist of the second default clause and the user’s intension:

e => e.start.date == yesterday() && e.attendees.contains(me())

Simply looking for contradictions, however, isn’t enough. Consider a query for all the events since the year began:

revise(
  e => e.start > now() && e.attendees.contains(me()), 
  e => e.start > beginning_of_year()
)  

In this case, the user’s intension isn’t contradicted by our default assumptions, but it is implied by them. If an event starts in the future, it necessarily occurs after the year began. If we don’t drop e.start > now(), we will effectively ignore what the user said.

Since both contradiction and implication are concerned with intrinsic properties of a datatype (as opposed to extrinsic properties like “this corresponds to an entity in a backend service”), our system can handle the revision process on its own. Developers can simply focus on defining the appropriate pragmatics for their domain.

The existence of a revision operator, combined with the fact that users speak intensionally, also means that we can give users the ability to tweak and build upon what they’ve already said.

Consider the utterance “cancel my meeting with Alice.” If the user and Alice work on the same team, it’s likely they have more than one upcoming meeting together. We can guess at which one they mean, but before actually cancelling the meeting we will show them a description of the event and ask for confirmation.

Typically, confirmation involves giving the user a choice between “OK” and “cancel;” either we did exactly what they wanted, or they need to start over. Revision, however, means we don’t need to start over. If the user follows up “cancel my meeting with Alice” with “I meant the one-on-one,” we’ll revise the first intension with the second, and look for a one-on-one with Alice.

This is enormously freeing for the user, because it means they don’t need to fit everything they want into a single, monolithic utterance. This is akin to the difference between batch and interactive computing; users can try things, see what happens, and quickly build upon their successes.

This is also enormously freeing for the developer, because it means they can afford to get things wrong. We provide the best tools we can to help developers interpret the user’s intent, but the cost of misinterpretation is small. In the worst case, the user will be forced to provide incrementally more information.

Final Thoughts

Wherever possible, business logic should be described by code rather than training data. This keeps our system’s behavior principled, predictable, and easy to change. Our approach to conversational interfaces allows them to be built much like any other application, using familiar tools, conventions, and processes, while still being able to take advantage of cutting-edge machine learning techniques.

When revisiting ideas from this earlier era of research, however, we must be careful; used wholesale, they’re likely to send us down the same path as the people who first proposed them. Sometimes, as with plans, we have to make minor modifications. Sometimes, as with constraints, we have to acknowledge complexities that weren’t even imagined by early researchers. Sometimes, as with revision, we have to create something entirely novel.

Doing this well requires a team with a wide variety of interests and expertise. In addition to people with expertise in computational linguistics, we’re also looking to hire people with backgrounds in programming language runtimes, constraint solvers, and SDK design. If this sounds like you, and everything described above sounds like something you’d want to work on, let us know.