Document
Reference


Multimodal interaction with computers is often portrayed as somewhat magical. We’ve all seen advertisements set in the not-too-distant future where people seamlessly interact with their computers using voice, touch screens, and various handheld devices. Those people in the ads always know just what to say and where to click, and the computer always understands. Unfortunately, human-computer interaction today is much less straightforward and more prone to errors. How do we move from today’s sometimes awkward and fragmented reality to the intuitive, cohesive future of multimodal communication with computers? In this article and an upcoming MSDN webcast, we explore the current state of our knowledge about multimodal interactions to help map the course from today’s reality to a multimodal future.
Universal access, any time, anywhere, using
any device. Smart systems that allow
users to communicate their needs naturally, with no constraints on the form of
their input. These are common elements
in media portrayals of human computer interaction in the future. Many of us hold an image in our minds of
Captain Picard issuing orders to the
The view of multimodal interaction coming from marketing departments may be overly optimistic. However, the vision of multimodal coming from engineering and development teams can often be quite limited. In these more concrete visions, the term multimodal often means nothing more than adding speech recognition capabilities to an existing website. There are two major issues with this vision. First, speech-enabling GUI interfaces is only one facet of true multimodal interaction. Second, even this limited vision is sometimes portrayed in an overly simplistic manner. Technical challenges abound for making multimodal interaction work.
· “Adding Speech” Is Not an Effective Strategy. Adding speech to a website or piece of software designed for unimodal GUI interaction simply doesn’t work. Multimodal interaction is a new and different way of interacting with computers. Multimodal interaction has all the complexities of the individual modes of interaction plus the issues of choosing, combining, and sequencing the input from individual modes.
· Content Of Multimodal Input Is Usually Not Redundant Between Modes. In the example above, speech and gesture are produced simultaneously and reinforce one another. The location term here confirms that the user is indeed intending to refer to a particular location on a map—specifically the location indicated by the pointing gesture. Early multimodal research, however, shows that users generally do not refer to the same things in two different modes. Instead, speech gives one piece of information and gesture provides another. This means we have to be able to process the data in each mode as stand-alone input, rather than depending on one mode to reinforce the other.
Users come into their interactions with automated systems with a set of terminology, metaphors, and organizational structures already in place. That is, users have a mental model of the domain and how to accomplish their goals. Tapping into and conforming to the user’s mental model is a challenge for any automated system (even single-mode systems). The challenge is amplified for systems based on unfamiliar technologies. Most people today have more experience with multimodal interaction from watching Star Trek than from their daily lives. Thus, even if we surmount the technical hurdles, multimodal interaction will be successful only if people are comfortable with the technology and want to use it. This will happen only if we can set users’ expectations appropriately and provide them with a satisfying experience.
Fortunately, we can focus on some good news about multimodal interaction. Tremendous benefits are possible for both users and businesses that move forward cautiously on the new frontier of multimodal interaction.
Multimodal interaction offers an exciting vision for the future of human-computer communication. The technical challenges of making multimodal interaction work are surmountable, especially for systems in limited domains. The issues associated with setting users’ expectations and providing a positive user experience may prove to be more significant challenges. There is much about the basic psychology of interacting across multiple modalities that is not yet known, and may not become evident until the first widely deployed multimodal systems are in use. Current and near-term future work in telematics, smart phones, and speech-enabled PDAs will provide the earliest real-world usage data about how people interact with multimodal systems. Focusing on the users of multimodal systems is the best way to ensure that we successfully reach the Star Trek vision of the future.