| Hardware Considerations in Natural Language Input | |
| Requirements for Speech Hardware | |
| Challenges for Speech Applications | |
| Resources |
Microsoft Windows XP operating system and the Microsoft Office XP productivity suite make great advances in delivering consumer applications that enable natural-speaking command and control.
The development of natural speech interaction is important in developing new capabilities and new markets for PC-architected products, including:
| • | Real-time voice-enabled chat and messaging over the web |
| • | Voice-support for Internet applications, such as multi-player games |
| • | Voice command and control of applications, such as Microsoft Office XP voice-enabled productivity applications |
| • | Media-rich entertainment scenarios including control of Windows Media Player |
| • | Enabling E-Tablet or portable devices with Natural Language Input |
| • | IP Telephony applications |
Minimum Support for Speech on PCs
| • | Minimum memory and processor support: |
| • | Minimum microphone support: as defined in PC 2001 System Design Guide |
Advances to Further Speech on PCs
To further the experience of natural speech, it will be necessary to address hardware and software advancements in the following areas:
| • | Require Digital I/O for speech input devices - choose USB microphones | ||||||
| • | Deliver array microphone with speech-enabled PCs | ||||||
| • | Develop support for:
| ||||||
| • | Design, develop and test to address:
|
To further the platform for natural speech input, we need to reduce the complexity of multiple signal paths. Duplicate analog and digital signal paths provide undue work loads on the software, so we need to move to a digital input only methodology for optimal results for a cohesive hardware and software system. Universal Serial Bus (USB) will provide the best delivery mechanism for a digitally enabled microphone. Thus future hardware requirements will have USB-only enabled microphones as a delivery mechanism.
With the advent of the "Entertainment PC," the PC will be used in a more social situation, so the needs for multiple speakers or single speakers will constitute the needs for adding a microphone array to the hardware. It is recommended that the microphone array be added to the monitor for optimal results. Microsoft will provide guidelines for proper design of injection molding for the microphone casing that will reduce the diffraction of the wavefront upon arrival to the microphone capsule.
To support microphone arrays, Microsoft will provide signal processing methodologies for enhanced speech-engine performance in future releases of the operating system. These enhancements will include Adaptive Beamforming and De-Reverberation algorithms. The Adaptive Beamforming algorithms increase the on-axis signal acquired by the microphone while simultaneously increasing the side lobe rejection of the microphone, which in turn reduces background noise.
Background ambient noise is a leading cause of errors for speech recognition engines. De-Reverberation algorithms can further reduce or even cancel the affects of the room acoustics, therefore would enable future scenarios for video teleconferencing applications. The users could have the acoustics of the other attendees overlayed or processed for more realism during a meeting.
Future Testing Guidelines. Some areas to be addressed for future areas of verification, testing, and recommendation include specifying the reference level for ambient noise level (A-weighted in decibels) and testing methods to verify the end-to-end mechanical and electrical characteristics of the microphone's acoustical, mechanical, and electrical system and any related subsystems.
The hardware requirements address issues related to microphone input compatibility for voice-input enabled applications such as acoustic echo cancellation, speech recognition, speakerphone telephony, and conferencing.
Current microphone guidelines (based on details in PC 2001 System Design Guide):
| • | Sensitivity: Input voltages 10mV-100Mv must deliver 0dB FS | ||||
| • | Electrects: 100Mv 0dB FS with minimum sensitivty -44dB relative to 1V/pascal | ||||
| • | Acoustic Echo Cancellation (AEC) mode assumes:
|
Future Requirements for Microphones
The PC 2001 guidelines assume analog input. Digital I/O is required for real advances for this capability on the PC platform.
The Windows Logo Program requirements for hardware refer to the PC 2001 guidelines for complete definition of capabilities.
PC 2001 Guidelines for Microphones
The PC 2001 guidelines (co-authored by Microsoft Corporation and Intel Corporation) are as follows:
Audio subsystem supports AEC reference inputs
Built-in or external audio devices that support full-duplex, stereo playback and record for speakers and microphone must support simultaneous capture of microphone and one or more AEC reference inputs.
At minimum, the audio device must support capture of microphone input in the left channel and a monaural output mix in the right channel, where the monaural output mix is the left and right channels of the main output merged into a single channel. This 1+1 channel interleaved format (similar to stereo) will be referred to as "mic+ref", and can be easily achieved using existing stereo ADCs.
For more information, see Section 6.2 of Audio Codec '97, Revision 2.1, from Intel Corporation, which describes one possible implementation. This specification is listed in "Audio References."
Analog microphone input meets PC 2001 jack and circuit requirements
This requirement enables users with electret or dynamic microphones to connect the device to their PC and achieve consistent results. This requirement also maintains compatibility with the installed base of microphones. For information about headset microphones, see AUD-0332, "If implemented, close-speaking headset microphone meets PC 2001 performance requirements."
If the PC has an analog microphone input, it must meet the following specifications:
| • | Three-conductor 1/8-inch (3.5 millimeters) tip/ring/sleeve microphone jack where the microphone signal is on the tip, bias is on the ring, and the sleeve is grounded. |
| • | Minimum AC input impedance between tip and ground: minimum, 4 kilohms. |
| • | Input voltages of 10-100 millivolts (mV) must deliver full-scale digital input, using software-programmable gain. |
| • | Maximum 5.5 V with no load, minimum 2.0 V with 0.8 milliampere (mA) load, direct current bias for electret microphones. |
| • | Minimum bias impedance between bias voltage source and ring: 2 kilohms. |
| • | AC coupled tip. |
Close-speaking headset microphone meets PC 2001 performance requirements
The following requirements are for close-speaking headset microphones intended for use in speech-recognition applications.
These requirements are compatible with most of the installed base of sound cards and audio-enabled system boards.
The requirements for a PC 2001 speech-recognition microphone are:
| • | Close-speaking headset design positions microphone within 1.5 inches of the corner of the speaker's mouth. | ||||
| • | FSOV: 100 mV (0 dB FS). | ||||
| • | Microphone connector meets requirements stated in the requirement AUD-0331, "If implemented, analog microphone input meets PC 2001 jack and circuit requirements." | ||||
| • | Operating bias voltage from 2.0-5.0 volts direct current (VDC) with a maximum current drain of 0.8 mA. | ||||
| • | Capable of sustaining a maximum voltage of 10 VDC on tip or ring without damage. | ||||
| • | Frequency response:
| ||||
| • | Minimum sensitivity of -44 dB relative to 1 volt per pascal. | ||||
| • | Maximum 2 percent THD+N 100 Hz to 10 kHz at 94 decibel sound pressure level (dBSPL). | ||||
| • | Noise cancellation null sensitivity at 90 degrees and 270 degrees, ±10 degrees, with the following minimums: 20 dB at 100 Hz 20 dB at 4000 Hz 20 dB at 400 Hz 10 dB at 10 kHz 20 dB at 1000 Hz | ||||
| • | Maximum wind noise sensitivity of -65 dB with 0 dB = 1 V (measured with wind speed of 1 meter per second at the 0 degree axis of microphone). | ||||
| • | Maximum output impedance of 1 kilohm (using a 1-kHz full-scale test tone with 2.0 VDC bias). |
The complete text of PC 2001 System Design Guide is available at http://www.microsoft.com/whdc/system/platform/pcdesign/desguide/pcguides.mspx.
More advances are required in the area of natural-speaking command and control, so that vendors can bring to market new products that are speech enabled.
With SAPI5 (Speech API version 5), it is easy to enable an application for speech. It takes ~300 lines of code to enable an existing application to take advantage of SAPI for dictation, command and control, and text-to-speech (TTS). With SAPI5, the application developer needs only to write to the Speech API, and then it is possible to plug in any speech engine without any modifications - there is no need to know the specifics for interfacing with each engine, as was true in earlier implementation.
However, the hard part is building a grammar for command and control so that the user, for example, can say "I want to send a letter to my Mom" and have an application launch the appropriate template. Today, the user must say, "File...New...Letters and Faxes." There is great opportunity in this field for the vendor who wants to invest in determining how to handle the different variances of phrases and build an appropriate grammar.
| • | Audio hardware and driver design notes for Windows |
| • | Windows Logo Program requirements for hardware |
| • | Microsoft Speech SDK 5.1 |
| • | Microsoft Speech Technologies |
| • | Microsoft Research on Speech Technologies |