Speech Input Technology - Architecture and Driver Support

Updated: December 4, 2001
On This Page
The Human Voice as a PC interfaceThe Human Voice as a PC interface
Why Speech Technology is ImportantWhy Speech Technology is Important
How Speech Technology Meets User NeedsHow Speech Technology Meets User Needs
How Speech Technology WorksHow Speech Technology Works
Current Issues for Speech TechnologyCurrent Issues for Speech Technology
Microsoft Speech API 5.0Microsoft Speech API 5.0
Microsoft Windows and Speech TechnologyMicrosoft Windows and Speech Technology
Guidelines for Hardware ManufacturersGuidelines for Hardware Manufacturers
Industry Standards and ActivityIndustry Standards and Activity

The Human Voice as a PC interface

Speech technology, after many years of research and development, is close to becoming a practical, mainstream way for users to interact with PCs. As users become familiar with speech-enabled applications, they will increasingly demand PCs and other devices that support speech technology.

Through the Speech Application Programming Interface (SAPI) 5.0, Microsoft supports speech technology in the Microsoft Windows XP operating system, the Microsoft Office XP suite, and other Microsoft applications.

System requirements have been well defined to assure a satisfactory user experience with speech technology. However, work remains to define specifications for microphone quality.

Top of pageTop of page

Why Speech Technology is Important

Businesses also benefit from speech technology because it can reduce support costs for applications such as routine customer service inquiries. Speech technology can provide an alternative means of input for employees with carpal tunnel syndrome or other disabilities that make mouse and keyboard input impractical. Speech can also supplement input-intensive applications such as computer automated design (CAD) programs, allowing the user to speak commands and dimensions while drawing with the mouse.

As a means to control and interact with a PC and applications, speech technology gives users several advantages:

Simplicity and convenience

Improved PC accessibility for people who are unable to use a keyboard or view a display

Voice support for Internet applications, such as multiplayer games

A simple and convenient interface for platforms such as Pocket PC, Tablet PC, and hand-held devices

Support for new applications such as AutoPC, which allows a driver to ask for and listen to spoken directions while both hands remain on the steering wheel

Top of pageTop of page

How Speech Technology Meets User Needs

Speech technology gives PC users three primary functions:

Command and Control: Use voice commands to navigate menus, toolbars, and dialogs in applications.

Dictation: Dictate directly into applications through a microphone attached to the PC, without having to use the keyboard or other input device.

Text-to-Speech (TTS): The computer reads typed text in a synthesized voice by way of the sound card and a headset or speakers attached to the PC.

Top of pageTop of page

How Speech Technology Works

Speech recognition engines are software drivers that convert the acoustical signal to a digital signal and deliver recognized speech as text to the application. Most speech recognition engines support continuous speech, meaning a user can speak naturally into a microphone at the speed of most conversations.

Software drivers called synthesizers, or text-to-speech voices, perform the tasks of converting text and generating spoken language. A text-to-speech voice generates sounds similar to those created by human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position.

Although easy to understand, the voice produced by synthesis technology tends to sound less natural than a voice reproduced by a digital recording of a human speaker.

Computer-based speech processing technology falls into two broad categories:

Speech Recognition: The ability to accept and process spoken language

Speech Synthesis: The ability to respond to a user by generating or synthesizing spoken language from text

Speech recognition, or speech-to-text, involves:

Capturing and digitizing the sound waves of a users voice

Converting the sound waves to basic language units, or phonemes

Constructing words from phonemes

Contextually analyzing words to ensure correct interpretation for words that sound alike (such as write and right)

To operate with acceptable levels of speed and accuracy, speech recognition engines use two primary tools:

A grammar that defines recognized words. For command and control mode, the grammar can be limited to the list of available commands. For continuous dictation, the grammar must encompass nearly the entire dictionary.

A speaker profile. This allows the speech recognition engine to accommodate the user's distinct speech patterns and accent.

Speech synthesis, or text-to-speech, is the process of converting text into spoken language by:

Breaking the words into phonemes

Analyzing text for special handling of numbers, currency, inflection, and punctuation

Generating the digital audio for playback

Top of pageTop of page

Current Issues for Speech Technology

Today, the user experience of speech technology too often means the frustration of slow response and inaccurate recognition. This experience is created in part by current issues related to natural-language recognition, use of CPU resources, and microphone quality. Microsoft is working with vendors to resolve these issues and enable speech technology to realize its full potential as a user interface.

Natural Language
The grammar for command and control built into most speech recognition applications today is limited largely to the commands on the toolbar menu. While this approach is simple for a user to understand and learn, it does not reflect the user’s natural choice of words and phrases.

With natural language support, a user can say "I want to send a letter to Mom" with the result that the Microsoft Word application opens a new document in the letter template starting with "Dear Mom." Today, the user must say, "File... New... Letters and Faxes" and related commands.

Processor Speed
Speech recognition makes intensive use of a PC’s processing resources, especially during dictation. Because the speech recognition engine begins to transcribe the audio input as soon as it detects the first word or two, a slower PC processor increases the time elapsed before recognition begins. In some cases, if the processor is too slow or system memory is inadequate, recognition will occur long after the user has finished speaking. This delay impacts the usability of speech recognition.

To reduce recognition latency, many speech recognition engines constrain the amount of resources allocated to speech processing. However, this constraint has the potential of reducing accuracy because the speech recognition engine will not search through a sufficiently large set of possible transcriptions, making it less likely the engine will find the correct transcription.

Microphone Quality
The audio quality of the microphone used for speech can have a significant effect on recognition accuracy. Industry experts have found that if the accuracy of speech recognition falls below 95 percent, users will not use speech as a PC interface. Although many microphones are acceptable for telephone applications and simple recording, they may not provide acceptable quality for speech recognition on PCs.

Speech technology information from Microsoft
http://www.microsoft.com/speech/speech2007/default.mspx.

Top of pageTop of page

Microsoft Speech API 5.0

SAPI 5.0 gives developers a rich set of speech services for building high-performance applications that run on desktop, mobile, and server platforms.

The speech services in SAPI 5.0 are enabled through the Component Object Model (COM), the Windows standard specification for software interoperability. The SAPI 5.0 speech services are compatible with and can be leveraged by many third-party speech recognition engines and synthesizers, programming languages, and tools.

SAPI 5.0 is a software layer that allows speech-enabled applications to communicate with both speech recognition and TTS engines. SAPI 5.0 includes an API and a Device Driver Interface (DDI). Applications communicate with SAPI 5.0 via the API layer; speech engines communicate via the DDI layer.

With SAPI 5.0, a speech-enabled application and a speech recognition engine do not communicate directly with each other; all communication is done via SAPI. In addition, SAPI takes responsibility for a number of functions in a speech system, such as:

Controlling audio input, whether from a microphone, files, or custom audio source.

Converting audio data to a valid speech engine format.

Loading grammar files, whether created dynamically or from memory, a URL, or a file. With SAPI 5.0, developers can build a grammar that includes a variety of natural-language phrases that might be spoken by users.

Resolving grammar imports and editing.

Compiling standard SAPI XML grammar format, converting custom grammar formats, and parsing semantic tags in results.

Sharing speech recognition across multiple applications using the shared engine; all coordination between the shared engine and applications is performed by SAPI.

Returning results and other information to the application and interacting with its message loop or other notification method. This return of results allows an engine to have a simple threading model because SAPI 5.0 performs much of the thread handling.

Storing audio and serializing results for later analysis.

Ensuring that applications do not cause errors, preventing applications from calling the speech engine with invalid parameters, and dealing with applications that hang or crash.

The speech recognition engine is responsible for:

Using SAPI grammar interfaces and loading dictation.

Performing speech recognition.

Polling SAPI to learn about grammar and state changes.

Generating recognitions and other events that provide information to the application.

Third-party SAPI 5.0-compliant products http://www.microsoft.com/speech/evaluation/thirdparty/applications.mspx

Top of pageTop of page

Microsoft Windows and Speech Technology

Because SAPI 5.0 is included in Windows XP, compliant applications do not need to distribute the API. Developers must install the SAPI 5.0 SDK to develop applications, but users do not need to install SAPI in order to use a speech-enabled application. This support allows applications to run across all devices, from hand-held computers to servers.

Microphones must meet the Windows Logo Program requirement, "Headset microphone used for speech recognition for systems that meets performance requirements," as defined in Microsoft Windows Logo Program System and Device Requirements, Version 2.0.

Windows XP will also offer this support:

An English TTS voice for Narrator, an accessibility feature supported in Windows 2000

Support for third-party speech-recognition applications that provide recognition engines compliant with SAPI 5.0

Microsoft Office XP includes SAPI 5.0 speech recognition for systems that meet the minimum hardware requirements listed below. For other speech applications that use SAPI 5.0, these requirements will help ensure a positive user experience:

400 MHz or higher processor

128 MB RAM

High-quality microphone

SAPI 5.0 SDK: http://www.microsoft.com/speech/

Microphones must meet the Windows Logo Program requirement, "Headset microphone used for speech recognition for systems that meets performance requirements," as defined in Microsoft Windows Logo Program System and Device Requirements, Version 2.0.

Top of pageTop of page

Guidelines for Hardware Manufacturers

Tools and sample code in the SAPI 5.0 SDK help software developers enable speech in their applications

Without SAPI 5.0, application developers must know the specifics of interfacing with each speech engine and write separate code for each interface. With SAPI 5.0, developers simply write to the API and can access any supported speech engine without modifications to the application code.

Designers creating PC systems targeted as speech-enabled platforms should include the following components to meet the performance expectations of users:

Faster processors. Although 400 MHz is a minimum requirement for speech support, processors with double or triple that clock speed can make a dramatic difference in speech recognition speed and accuracy.

Memory of 128 to 256 MB RAM. More RAM helps to minimize paging while dictated words are cached.

Hard disk drives with spin rates of 7200 RPM or higher. Fast drive speeds will increase recognition speed for dictation when paging is necessary during caching.

High-quality headsets. To ensure the speech engine can accurately understand spoken words, select headsets that meet the requirements defined in Microsoft Windows Logo Program System and Device Requirements, Version 2.0.

Windows Logo Program Requirements http://www.microsoft.com/whdc/winlogo/default.mspx

Tools and sample code in the SAPI 5.0 SDK help software developers enable speech in their applications

Without SAPI 5.0, application developers must know the specifics of interfacing with each speech engine and write separate code for each interface. With SAPI 5.0, developers simply write to the API and can access any supported speech engine without modifications to the application code.

Top of pageTop of page

Industry Standards and Activity

Microsoft is working with other vendors to develop guidelines for hardware that supports speech recognition in applications, such as precise criteria for array microphones.

Speech Information for Windows Platforms
http://www.microsoft.com/speech/
http://www.microsoft.com/whdc/device/audio/speech/default.mspx


Top of pageTop of page