Text Box: The Truth about Multimodal Interaction


August 2003

Susan L. Hura, PhD
Head of User Experience
&
Ron Owens
Director or Professional Services

 

 

 

27 July 2001

 

Document Reference

 


The Truth About Multimodal Interaction

Multimodal interaction with computers is often portrayed as somewhat magical.  We’ve all seen advertisements set in the not-too-distant future where people seamlessly interact with their computers using voice, touch screens, and various handheld devices.  Those people in the ads always know just what to say and where to click, and the computer always understands.  Unfortunately, human-computer interaction today is much less straightforward and more prone to errors.  How do we move from today’s sometimes awkward and fragmented reality to the intuitive, cohesive future of multimodal communication with computers?  In this article and an upcoming MSDN webcast, we explore the current state of our knowledge about multimodal interactions to help map the course from today’s reality to a multimodal future. 

Just like Star Trek

Universal access, any time, anywhere, using any device.  Smart systems that allow users to communicate their needs naturally, with no constraints on the form of their input.  These are common elements in media portrayals of human computer interaction in the future.  Many of us hold an image in our minds of Captain Picard issuing orders to the Enterprise computer in much the same way he would speak to his crew: ‘Computer, find Mr. Data!’  While this is a desirable vision of human computer interaction, such images set an unrealistic expectation for how people will be able to communicate with computers in the foreseeable future.  A number of significant technical and socio-cultural hurdles must be overcome before any of us will have Picard's ease and confidence with multimodal interactions.

Technical Hurdles

The view of multimodal interaction coming from marketing departments may be overly optimistic.  However, the vision of multimodal coming from engineering and development teams can often be quite limited.  In these more concrete visions, the term multimodal often means nothing more than adding speech recognition capabilities to an existing website.  There are two major issues with this vision.  First, speech-enabling GUI interfaces is only one facet of true multimodal interaction.   Second, even this limited vision is sometimes portrayed in an overly simplistic manner.  Technical challenges abound for making multimodal interaction work.

·         “Adding Speech” Is Not an Effective Strategy.  Adding speech to a website or piece of software designed for unimodal GUI interaction simply doesn’t work.  Multimodal interaction is a new and different way of interacting with computers.  Multimodal interaction has all the complexities of the individual modes of interaction plus the issues of choosing, combining, and sequencing the input from individual modes.

·         Multimodal Input Is More Complex Than “Speak and Point.   It is often assumed that in a GUI-plus-speech interface, users will use speech to issue commands or state their requests, and will use a pointing device (mouse or pen) to indicate the object or location of their request.  A typical example is a user who says, “I need directions to get here,” where here is accompanied by pointing to a location on a map.  The reality is that users are not nearly as uniform in their responses or as limited in their use of the different modalities.

·         Content Of Multimodal Input Is Usually Not Redundant Between Modes.  In the example above, speech and gesture are produced simultaneously and reinforce one another.  The location term here confirms that the user is indeed intending to refer to a particular location on a map—specifically the location indicated by the pointing gesture.  Early multimodal research, however, shows that users generally do not refer to the same things in two different modes.  Instead, speech gives one piece of information and gesture provides another.  This means we have to be able to process the data in each mode as stand-alone input, rather than depending on one mode to reinforce the other.

Socio-Cultural Hurdles

Users come into their interactions with automated systems with a set of terminology, metaphors, and organizational structures already in place.  That is, users have a mental model of the domain and how to accomplish their goals.  Tapping into and conforming to the user’s mental model is a challenge for any automated system (even single-mode systems).  The challenge is amplified for systems based on unfamiliar technologies.  Most people today have more experience with multimodal interaction from watching Star Trek than from their daily lives.  Thus, even if we surmount the technical hurdles, multimodal interaction will be successful only if people are comfortable with the technology and want to use it.  This will happen only if we can set users’ expectations appropriately and provide them with a satisfying experience.

  • Setting User Expectations.  Interaction with personal computers is familiar and comfortable for many people, but speech technology is still new and unknown to many.  Users don’t know what to say or how they must speak to be understood.  Multimodal interaction adds the uncertainty surrounding the choice between input modalities (should I say it or click it?) and sequencing of input across modalities (do I talk then click, or click then talk, or click and talk together?)  Uncertainty leads to greater variability in user input, even for the same user, thus increasing the technical challenges. 
  • Creating Positive User Experiences.  User experience is the driving force behind customer satisfaction.  Positive user experiences occur when users feel they are in control of their interaction with an automated system and that the system supports, rather than hinders, accomplishing their goals.  To provide a satisfying user experience, multimodal interface designers must create systems that conform to users’ mental models and meet their expectations about the interaction and the domain.

The Good News

Fortunately, we can focus on some good news about multimodal interaction.  Tremendous benefits are possible for both users and businesses that move forward cautiously on the new frontier of multimodal interaction. 

  • Users Have a Strong Preference for Multimodal Interaction.  Users seem to know that their needs would be better met if they had more choices in interacting with automated systems.  Moreover, preliminary studies of multimodal interaction show that users have a natural tendency to choose the “right” modality for their input.  That is, users will tend to point to indicate direction rather than saying error-prone things like East/West coordinates.  Consequently, multimodal interaction may enable applications that are more robust than the individual contributing technologies.
  • Multimodal Input May Often Be Simpler Than Expected.  Language spoken in multimodal interactions tends to be simpler than human-to-human language, showing simpler syntax, fewer words per utterance, and less ambiguity.  Additionally, much input to multimodal systems may be unimodal because only some of the information users want to communicate needs to be expressed across multiple modes. 
  • Multimodal Systems Are More Flexible.  This flexibility allows users to select the modality that best fits the situation, thus reducing user-errors in multimodal input.  When errors do occur, multimodal systems provide the means for more effective error recovery.  The biggest benefit of flexibility is its potential to increase access to automated systems for more individuals in more varied environments.

Next Steps

Multimodal interaction offers an exciting vision for the future of human-computer communication.  The technical challenges of making multimodal interaction work are surmountable, especially for systems in limited domains.  The issues associated with setting users’ expectations and providing a positive user experience may prove to be more significant challenges.  There is much about the basic psychology of interacting across multiple modalities that is not yet known, and may not become evident until the first widely deployed multimodal systems are in use.  Current and near-term future work in telematics, smart phones, and speech-enabled PDAs will provide the earliest real-world usage data about how people interact with multimodal systems.  Focusing on the users of multimodal systems is the best way to ensure that we successfully reach the Star Trek vision of the future.