Click Here to Install Silverlight*
United StatesChange|All Microsoft Sites
|Developer Centers|Library|Downloads|Code Center|Subscriptions|MSDN Worldwide
Search for

Advanced Search

January 1996

Microsoft Systems Journal Homepage

Talk to Your Computer and Have It Answer Back with the Microsoft Speech API

Mike Rozak

Mike Rozak is a development lead in the Personal Systems Division of Microsoft, tasked with the incorporation of speech into the operating system's APIs. Mike has been working with engine vendors to design and implement a speech recognition and text-to-speech API.

Click to open or copy the CLOCK project files.

In the beginning, humans communicated with their computers using soldering irons and voltmeters. Needless to say, this grew tiresome quickly. So someone had the bright idea of using toggle switches and light bulbs. But that wasn't so hot either, so soon scientists figured out a way to feed their computers instructions on little cards with holes in them; the computers spat their own cards out the other end. Still pretty awkward. Things started really cooking when keyboards and monitors came along. Now people were communicating in a strange dialect with words like mv and grep. The enter key meant "do it." And if you wanted to read the results in the bathroom, you could print them on a big old clunky lineprinter.

These days, the march to make computers communicate in ways that come naturally to humans continues. In the quest for a perfectly transparent user interface, speech is perhaps the final frontier-short of direct brain-link.

Admit it, since you were a kid you wanted to talk to a computer the way Mr. Spock talks to his computer aboard the Enterprise.

"Computer, what time is it?"

"Ten fifteen."

"Shoot, I'm late for my pon farr. Hey, print off the latest science officer data for me, wouldja?"

"Sure thing, Spock. You want that by animal or mineral?"

If that sounds far-fetched, keep reading. In this article, I'll bring you up to speed on what's happening with computer speech, and I'll show you how to write a simple talking clock program that speaks the current time of day whenever you ask, "What time is it?" Really!

Wherefore Speech?

I know what you're thinking. "Why would I want to talk to my computer?" "Why would I want my computer talking to me?" You imagine a cacophony of computers and people gabbing away in their cubicles. You think how silly you'd feel sitting in your home office, talking to a beige box on your desk.

Well, it's true that keyboards and mice are in little danger of becoming obsolete any time soon, but there are nevertheless many situations where speech is useful. Have you ever played a computer game where a character asks you a question? A cartoon-style text balloon pops out of the character's mouth and you answer by clicking a button. Wouldn't it be more natural if the character really spoke? And for you to answer back in English? Or French, if that's your preference.

Or how about this. Your screen is littered with toolbars, and you can't remember whether it's Ctrl-Alt-F8 to double-underline, or Alt-Shift-F4 or Alt-Control-Shift-Mumble-Whatever. Why not just select some text and say, "Double underline this." You wouldn't have to shout; you could say it softly.

Or maybe you'd like to call your bank and transfer some money from savings to checking. Instead of playing twenty questions with the synthetic phone operator as you maneuver through seven levels of prerecorded menus, why not just say, "Transfer one hundred dollars from savings to checking?" Of course, in that case, you'd be talking to the bank's computer, not yours-but you might want to call your own PC to ask, "Do I have any email?" or "Look up Mary Smith's number in my address book." No fussing with buttons while you're driving in traffic; no need for a laptop or modem. Just dial up and talk!

Or, if you're one of the millions of people who suffer from repetitive motion injuries like carpal tunnel syndrome, why not give your fingers a break once in a while. Don't type, dictate. There are other "hands off" situations where people need to use their computers while doing something else like operating a piece of machinery. Or maybe you just want your computer to read words or numbers back to you as you type them, to help catch typing errors. These are just a few areas where computer speech is really useful.

How Do They Do That?

You don't need to understand the intricacies of speech technology to use it in your apps, but I suspect many of you are curious, so I figured I'd give you a very short overview of how it works.

There are two basic technologies: speech recognition (SR) and speech synthesis, depending on who is doing the talking-you or the computer. Speech synthesis is commonly called "text-to-speech" or TTS, since the speech is usually synthesized from text data. Figure 1 shows the architecture of a typical text-to-speech engine.

Figure 1 Text-to-Speech Engine

The process begins when the application hands the engine a string of text such as, "The man walked down 56th St." The text analysis module converts numbers into words, identifies punctuation such as commas, periods, and semicolons, converts abbreviations to words, and even figures out how to pronounce acronyms. Some acronyms are spelled out (MSJ) whereas others are pronounced as a word (FEMA). The sample sentence would get converted to something like

The man walked down fifty sixth street

Text analysis is quite complex because written language can be so ambiguous. A human has no trouble pronouncing "St. John St." as "Saint John Street," but a computer, in typically mechanical fashion, might come up with "Street John Street" unless a clever programmer gives it some help.

Once the text is converted to words, the engine figures out what words should be emphasized by making them louder or longer, or giving them a higher pitch. Other words may be deemphasized. Without word emphasis, or "prosody," the result is a monotone voice that sounds robotic, like something out of a '50s sci-fi flick. After adding prosody, thesamplesentence might end up like this:

<de-emphasize>the <emphasize>man walked 
<emphasize>down fifty <emphasize>sixth street<pause>.

Next, the text-to-speech engine determines how the words are pronounced, either by looking them up in a pronunciation dictionary, or by running an algorithm that guesses the pronunciation. Some text strings have ambiguous pronunciations, such as "read." The engine must use context to disambiguate the pronunciations. The result of this analysis is the original sentence expressed as phonemes. "Th-uh M-A-Nw-au-l-k-tD-OU-Nf-ih-f-t-eeS-IH-K-S-TH s-t-r-ee-t".

Next, the phonemes are parsed and their pronunciations retrieved from a phoneme-to-sound database that numerically describes what the individual phonemes sound like. If speech were simple, this table would have only forty-four entries, one for each of the forty-four English phonemes (or whatever language is used). In practice, each phoneme is modified slightly by its neighbors, so the table often has as many as 1600 or more entries. Depending on the implementation, the table might store either a short wave recording or parameters that describe the mouth and tongue shape. Either way the sound database values are finally smoothed together using signal processing techniques, and the digital audio signal is sent to an output device such as a PC sound card and out the speakers to your ears.

That's text-to-speech. Speech recognition is the flip side. Figure 2 shows a generic speech recognition engine. When the user speaks, the sound waves are converted into digital audio by the computer's sound card. Typically, the audio is sampled at 11KHz and 16 bits. The raw audio is first converted by the frequency analysis module to a more useful format. This involves a lot of digital signal processing that's too complicated to describe here. The basic challenge is to extract the meaningful sound information from the raw audio data. If you were to say the word "foo," and then say "foo" again, and look at the waveforms generated, they would look kind of similar, but there's no way you could compare them that will consistently recognize them as the same sound, without applying some pretty hairy mathematical techniques using Fourier transforms. Fortunately, people have already figured this stuff out.

Figure 2 Speech Recognition Engine

The converted audio is next broken into phonemes by a phoneme recognition module. This module searches a sound-to-phoneme database for the phoneme that most closely matches the sound it heard. Each database entry contains a template that describes what a particular phoneme sounds like. As with text-to-speech, the table typically has several thousand entries. While the phoneme table could in theory be the same as that used for TTS, in practice they are different because the SR and TTS engines usually come from different vendors.

Because comparing the audio data against several thousand phonemes takes a long time, the speech recognition engine contains a phoneme prediction module that reduces the number of candidates by predicting which phonemes are likely to occur in a particular context. For example, some phonemes rarely occur at the beginning of a word, such as the "ft" sound at the end of the word "raft." Other phonemes never occur in pairs. In English, an "f" sound never occurs before an "s" sound. But even with these optimizations, speech recognition still takes too long.

A word prediction database is used to further reduce the phoneme candidate list by eliminating phonemes that don't produce valid words. After hearing, "y eh," the recognizer will listen for "s" and "n" since "yes" and "yen" are valid words. It will also listen for "m" in case you say "Yemen." It will not listen for "k" since "yek" is not a valid word. (Except in baby-talk, which is not currently supported.) The candidate list can be reduced even further if the application stipulates that it only expects certain words. If the app only wants to know if the user said "yes" or "no," the phoneme recognizer needn't listen for "n" following "y eh," even though "yen" is a word. This final stage reduces computation immensely and makes speech recognition feasible on a 33MHz 486 or equivalent PC. Once the phonemes are recognized, they are parsed into words, converted to text strings, and passed to the application.

As you might imagine, both text-to-speech and speech recognition involve quite a bit of processing, but speech recognition is harder because it usually requires more processing for equivalent user satisfaction. A few years ago, you needed a high-end workstation to do speech recognition. Today, just about every new PC and even many older PCs can handle speech. While the exact requirements vary from one speech engine to another, Figure 3 gives you a rough idea of the hardware needed to run various kinds of speech applications under Windows. The faster the CPU and the more memory available, the higher the accuracy for speech recognition and the better the text-to-speech sounds.

Of course, you also need a sound card, microphone, and speakers. Most speech engines will work with any sound card. Some systems offload processing onto a DSP (digital signal processor) chip that comes on some high-end sound cards, which cuts the CPU speed requirement in half. Better microphones and speakers will also improve things.

As speech has become more feasible on average PCs, vendors have been busy developing and promoting their speech engines. Many multimedia PCs and sound cards come bundled with speech software. Others vendors sell their engines as standalone products. Some apps even come bundled with speech engines.

Unfortunately, as with any budding technology, the situation is a bit chaotic. Even though they all support similar functionality, each speech engine has its own specific features and proprietary API. If you want to use speech in your app, you've first got to pick which engine to use, and write your program for that engine. If a better engine comes along, you're out of luck. You'll probably have to rewrite your program substantially to use the other API. Proprietary APIs have stifled the widespread adoption of speech. When faced with an irrevocable decision about which engine to use, many developers choose not to implement speech at all.

The Microsoft Speech API

The Microsoft® Speech API is an attempt to correct this problem. By promoting an industry-standard programming interface for speech, Microsoft hopes to encourage developers to write speech-enabled apps. But I'm not here to spout business strategies, I'm here to tell you about the API!

The Speech API lets you write Win32®-based apps (for Windows® 95 or Windows NT™) that use speech recognition and text-to-speech. The API is specified as a collection of OLE Component Object Model (COM) objects. Using OLE makes speech readily available to developers writing in Visual Basic®, C/C++, or any other programming language that can access OLE objects directly or through automation. The Speech API requires Windows 95 or Windows NT 3.51, and since the API doesn't actually do anything, you still need a third-party speech engine, one for SR and one for TTS.

As with other Windows Open Services Architecture (WOSA) services, the Speech API is intended as a standard interface that application developers and engine vendors alike can code to. Programmers can write apps without worrying about which engine to use, engine vendors can get instant compatibility with all speech apps, and users gain the freedom to choose whichever speech engine meets their budget and performance requirements. The situation is analogous to GDI, which lets programs draw graphics without worrying about what kind of display card or monitor the user has. Just like GDI, the Speech API provides escape hooks to access proprietary engine features when you need to do something special.

The Speech API offers two levels of access: high-level objects designed to make implementation easy, and low-level objects that offer total control but make you do a little more work. If all your program does is listen for a few voice commands and utter some simple phrases, you can use the high-level objects. To do more sophisticated stuff, you need the low-level.

The high-level objects, provided by Microsoft, don't do any SR or TTS themselves; they just call the low-level objects to do the work. The low-level objects are provided by the speech engine vendor, just like the video and sound card drivers that come with your display or sound card. When your app uses the low-level API, it's talking directly to the third-party code, bypassing Microsoft code completely (see Figure 4). The low-level API is too complex to describe here, so I'll focus on the high-level objects and just give you a quick overview of the low-level stuff.

Figure 4 Using the Low-level Speech API

Whichever you use, you'll be dealing with OLE objects. Figures 5 and 6 show the main OLE objects and interfaces that constitute the Speech API. Don't worry, you'll probably never need to use most of the objects shown in Figure 6. The objects you're most likely to use are voice commands for speech recognition and voice text for text-to-speech. Microsoft also provides a speech recognition sharing object that lets several apps share engines.

Voice Commands and the Talking Clock

To show just how easy it is to write apps that talk and listen, I wrote a talking clock program (see Figure 7) that speaks the time and/or date whenever you ask "What time is it?" or "What day is it?" Clock will probably seem like a ghost of an app to you: it has no menu, and in fact it doesn't even have a window! There's no need for either, since all it does is talk in response to verbal commands. Of course, most speech apps will still have menus and windows and generally look like normal apps. Clock merely demonstrates that they don't have to.

Figure 7 Voice Commands and Menus

Clock uses the high-level SR Voice Commands object to listen for commands from the user. The main interface, IVoiceCmd, provides functions to do simple "command and control" speech recognition. Users can issue simple commands like "Open the file" and answer simple yes/no questions. For more sophisticated kinds of speech recognition such as dictation, you'd have to use the low-level API.

Voice Commands work a lot like traditional Windows menus. You first create a voice menu of commands you want to listen for, then you listen for them. Pretty simple. Most programs will have one voice menu for the main window, and one for every dialog box. When the SR engine hears a command,it notifies the appropriate (active) app.The Voice Commands module actually includes a few different objects. The main one is the Voice Commands object, which provides basic functions to turn speech recognition on or off and create voice menu objects.

The first thing Clock does is initialize OLE by calling CoInitialize. (If you're using MFC, all you have to do is check the "Container" or "Both container and server" check boxes when AppWizard asks what kind of OLE support you want; AppWizard generates a call to AfxOleInit in your app's InitInstance function.) Once OLE is initialized, Clock creates a Voice Commands object.

 CoCreateInstance(CLSID_VCmd, NULL,
                 (LPVOID *)&gpIVoiceCommand);

CoCreateInstance creates a local instance of the Voice Commands object. CLSID_VCmd is the class ID. CLSCTX_LOCAL_SERVER indicates that the object should be created on the local machine, but in a different process from the app. The active application, such as a word processor, can have a voice menu listening while Clock's menu is listening too. If the user says, "Print the document," the command goes to the word processor; if the user asks, "What time is it?" Clock gets the command. IID_IVoiceCmd is the interface ID for the Voice Commands interface and gpIVoiceCommand is a pointer to the this interface that's filled in by CoCreateInstance. All the symbols you need are defined in SPEECH.H.

To actually create the object, CoCreateInstance fires up WINDOWS\SPEECH\VCMD.EXE (if it's not already running). Some other DLLs are used too: VCMSHL.DLL contains marshaling code, and SPEECH.DLL contains some objects for the low-level API. Each engine also has its own DLLs. But as far as the app and you are concerned, everything is handled by OLE. You don't have to worry about what files are loaded, it all happens automagically.

Before you can create a menu, you must register a notification sink.

  &gVCmdNotifySink,      // interface pointer(returned)
  IID_IVCmdNotifySink,   // interface ID
  0,                     // high priority notifications
  NULL);                 // VCSITEMINFO

The empty string tells the Voice Commands object to listen to the default wave-in device, normally the microphone. Alternatively, you can pass a string like "Line1" to listen for commands over phone line number one. The string refers to a system registry entry that identifies the SR engine and wave device to use. gVCmdNotifySink is the notification sink-which I'll describe shortly-and IID_IVCmdNotifySink is the interface ID, which identifies what kind of sink it is. Currently, IVCmdNotifySink is the only one, but in the future others may be supported. The 0 tells Voice Commands to send Clock only the most important notifications. Voice Commands can notify apps when the user is talking too loud, but Clock doesn't care about that.

Once you've registered a notification sink, you can create a voice menu. The system supports multiple voice menus that can be independently activated (listening) or deactivated (not listening). A CAD program might have one voice menu with commands such as "Save the file" that are always active, and another voice menu with commands like "Rotate 90 degrees" that are only active when something is selected. Unlike normal Windows menus, several voice menus can be active at the same time. You can even make a menu global, so it's still listening when your app doesn't have the focus. Clock does this so the user can ask "What time is it?" while working in any app.

To create a menu, you set up a couple of structures that give the menu a name and select a language. The API supports all languages, but the user can obviously only use the languages actually installed on the machine. Most speech engines support English, German, French, Japanese, Spanish, Italian, and a few others. Because it's so expensive to produce a language, many less common languages are not yet supported by any engine-though I'm sure that somewhere, someone with nothing better to do is at this very moment working on one for Klingon. To create the menu, you call IVoiceCmd's MenuCreate function.

  &VCmdName,                  // menu name
  &Language,                  // language
  VCMDMC_CREATE_TEMP,         // don't archive
  &gpIVCmdMenu                // ptr to menu (returned)

VCmdName and Language identify the menu name and language; VCMDMC_CREATE_TEMP tells the API to create a temporary menu, which will not have its contents archived to disk. You can create permanent menus that are saved in a database so that load times are faster, but Clock doesn't. gpIVCmdMenu is filled with a pointer to the IVCmdMenu interface for the new menu object. The menu starts out empty. IVCmdMenu has methods that add, remove, and modify voice commands. For Clock, I wrote a wrapper function, AddCommand, that bundles its arguments into a structure and passes it to IVCmdMenu::Add.

           "What time is it?",
           "What day is it?",
           "Stop running Talking Clock.",

I added the commands one at a time, but you can add hundreds of commands in a block if you want. Note how the commands are given as ordinary ASCII strings-you don't have to mess with phonetic representations or anything like that. The IDC_XXX constants identify the commands, similar to normal menu IDs. The API imposes no limit on the number or size of commands, but accuracy and performance will degrade if you add more than a few hundred. To actually start listening, Clock activates the menu:

 gpIVCmdMenu->Activate(NULL, 0);

I pass NULL for the window handle to make the menu global, so Clock listens all the time, even when another app has focus. The chances are pretty good that no other app will be listening for any of the three commands in Clock-but if one does, the system is smart enough to notify it, rather than Clock. Assuming this is not the case, when the SR engine hears "What time is it?" (or either of the other two commands), it notifies Clock through Clock's notification sink.

In OLE, a notification sink is just a callback object that some object uses to notify your app when something happens (see Figure 7). Each sink has its own interface of notification functions. Clock implements an object, CIVCmdNotifySink, that has the IVCmdNotifySink interface (see Figure 8). The only notification that Clock cares about is CommandRecognize; all the other functions have empty implementations. When the SR engine hears "what time is it?" it calls CIVCmdNotifySink::CommandRecognize.

CIVCmdNotifySink::CommandRecognize(DWORD dwID,...)
  switch (dwID) {
    // Speak the time (described later)...
      // Speak the date (described later)...
      DestroyWindow (ghWndMain);
  return NOERROR;

CommandRecognize has a lot of arguments, most of which Clock doesn't use. The only important one is the command ID, dwID. As with a WM_COMMAND message, you do a switch on the ID. If your app has a normal Windows menu with the same actions, you should use the same IDs. In fact, you could even pass the notifications to your main window as a WM_COMMAND message.

CIVCmdNotifySink::CommandRecognize(DWORD dwID,...)
   SendMessage(ghWndMain, WM_COMMAND, dwID, 0);
   return NOERROR;

If you're using MFC, you'd send the message to AfxGetApp()->m_pMainWnd instead of ghWndMain-or perhaps you'd store a pointer to the main window in your CIVCmdNotifySink. Of course, as with all OLE objects, you've got to release them when you're finished.

 // Release menu
if (gpIVCmdMenu)
gpIVCmdMenu = NULL;
 // Release Voice Commands object
if (gpIVoiceCommand)
gpIVoiceCommand = NULL;
 // Terminate OLE
CoUninitialize (); 

This sequence appears in Clock's ShutDown function, called at the end of WinMain as Clock is terminating. In MFC, you could release the objects in your main window's OnDestroy handler or in your app's ExitInstance function. With MFC, you don't have to terminate OLE; it takes care of that for you.

That's it! Clock now recognizes your voice! Of course, it doesn't actually do anything since I haven't added the text-to-speech part yet.

Voice Text

To make Clock talk, I need voice text, the high-level object for text-to-speech (see Figure 9). The voice text module has only one object, the voice text object. Using it is pretty straightforward.

 CoInitialize(NULL); // if you haven't done it already
                 (LPVOID *)&gpIVTxt);

Figure 9 Voice Text

It's pretty much the same as creating a Voice Commands object; only the IDs are different. As with Voice Commands, you must register a sink to receive notifications:

 gpIVTxt->Register("",  // default wave device
                  gszAppName,          //app name
                  NULL,                //notify sink
                  IID_IVTxtNotifySink, //notifysinkIID
                  NULL,                // flags
                  NULL );              // VTSITEINFO*

The empty string selects the default wave out device, normally the sound card. You could use Line1 or some other audio output device defined in the registry. Voice text calls IVTxtNotifySink whenever something happens; for example, when the TTS engine starts or stops talking, or when someone (the user or another app) changes global attributes such as voice's volume or pitch. Clock doesn't care about any of that nor does it even register IVTxtNotifySink, so it passes NULL for the notification sink. But even if your sink is NULL, you still have to register because voice text needs your app's name. That's all the setup you need; when it's time to talk, just get the time and call IVoiceText::Speak.

TCHAR szTemp[128];
strcpy (szTemp, "The time is ");
GetLocalTime (&st);
GetTimeFormat (0, TIME_NOSECONDS, &st, NULL,
gpIVTxt->Speak( szTemp, VTXTSP_NORMAL, NULL );

The call to Speak happens asynchronously. That is, control returns immediately; your app doesn't wait for the computer to finish speaking. (But when it does, it can notify you through IVTxtNotifySink) Like other Win32 API functions, IVoiceText::Speak accepts ANSI or Unicode, as determined by the compile-time #define symbol UNICODE.

That's it. Clock now talks! If you don't believe me, grab the code (from the usual MSJ sources) and run it yourself! Of course, you need a sound card, speakers, a microphone, and the Speech API. I'll tell you how to get the API at the end of the article.

Clock doesn't provide any way for the user to select or change the sound of the computer voice. That's because it doesn't need to. Voice text uses whatever the user has selected as the system default. Most people want their computers to always speak with the same voice. The voice quality (male/female, the pitch, and so on) is specified through a Control Panel applet called Microsoft Voice, installed as part of the Microsoft Voice setup (for more, see the sidebar). You can change the voice programmatically if you like-games may even need several voices-but you need the low-level API for that. My advice is to avoid it for most apps. You don't want to annoy users who have taken the trouble to select their ideal cybervoice. They might think their computer is possessed.

Low-Level Grunge

Voice Commands and voice text objects expose enough functionality to implement moderately sophisticated speech apps. Clock uses only a few of the many functions and features available through the high-level API. Still, there are times when you need to do something more sophisticated, like take dictation or use multiple voices. For that, you need the low-level API, which lets you talk directly to the speech engine. There's not enough room here to describe it in detail, but I can give you some idea of the sorts of things you can do with the low-level objects.

Imagine that you're writing a transcription program that translates an audio recording of a meeting or telephone call into text. Such a program would need to use the low-level objects to perform dictation and to "listen" to a wave file instead of the microphone. Here's a quick walkthrough that explains how it might work (see Figure 10).

Figure 10 Low-level SR Objects with Custom Audio Source

The app first determines where the audio should come from and creates an audio source object through which to acquire digital audio data. Microsoft supplies an audio source object that gets its audio from the multimedia wave-in device (usually the microphone), but you can write your own so that your app can get audio from wave files or specialized hardware devices. The transcription app implements a custom audio source to get audio from a wave file. This audio source object would probably have a custom interface with functions like Open and Close that let the app select which file to use.

The app would create an SR engine enumerator object (not shown in Figure 6, but provided by Microsoft), and use it to find the SR engine it wants to use. You can search for engines that support specific languages or features, the same way you might look for a font with serifs. For example, the transcription program might look for a SR engine that supports context-free grammars. (I'll explain what that is in a moment.) Once it finds the right engine, it creates an instance of it and passes it the audio source object.

The SR engine object has a dialogue with the audio source object to find a common audio format. For example, 16-bit 11KHz pulse code modulation (PCM). Your custom audio source would read the format from the wave file to check that the file is in the right format. Assuming it is, the engine registers an audio source notification sink with the audio source object. Now the audio source object submits digital audio data to the engine through the notification sink. All this happens invisibly to the app, which only has to set things up.

The app next registers a main notification sink that receives grammar-independent notifications such as whether or not the user is speaking, or is speaking too loudly. You could use this information to tell the user to speak softly. The transcription program would use it to figure out when the user starts or stops speaking.

Next, the transcription program creates a grammar object. This plays the same role as the voice menu object, except a grammar object recognizes much more complex speech patterns. When you create a voice menu, you provide a list of phrases to listen for; when you create a grammar object, you provide a set of rules called a context-free grammar that specifies which words can grammatically follow one another. A typical a rule might look something like this:

 <Start> = [please] send mail to (Mike | Fred | Bob)

You can probably decipher the notation yourself. "Please" is optional; while the parenthesis and | (logical OR) symbols indicate that either Mike or Fred or Bob is expected. A user could say, "Please send mail to Mike," or "Please send mail to Bob," or "Send mail to Bob," and so on.

If the transcription program can't predict in advance what it's listening for, it would forgo the context-free grammar approach and opt for dictation. While context-free grammars are quite rich, they are not very efficient. It would take more memory than your computer has to store a context-free grammar for English. A dictation grammar is a different kind of grammar with special tricks for reducing the number of rules. A dictation grammar lets you express rules like "verb and noun must agree in number."

Whichever grammar you use, the grammar object notifies your app when something happens through yet another sink, the grammar notification sink. When the grammar recognizes a word or phrase, or has other grammar-specific information to report, it calls functions in the grammar notification sinks. Your app implements a sink that responds by taking whatever action it wants. The most important notification is PhraseRecognize. The grammar provides a text string of the spoken words. The transcription program would write them into a text file, perhaps along with timing information.

Typically, the engine knows a lot more than just what was spoken. It may have a list of alternative phrases (was it "Swing the cat" or "Swing the bat"?), timing information, or information about who is speaking. You can request a results object and interrogate it to find out more. This is how you'd get timing information.

The low-level speech objects are designed to support just about any feature a contemporary speech engine might offer. Because the API is so broad, not every engine supports every interface. This is especially true with results objects. For example, every engine can return the spoken words and timing information, but very few can identify the speaker. As a way of dealing with this, the formal API specification identifies a core set of mandatory features, and provides a mechanism to query which of the optional ones a given engine supports.

So much for speech recognition. The low-level text-to-speech objects are similar, but not as complex. An example of a program that might use low-level TTS functions is a mixer program that merges TTS with an audio file. You might write some poetry as text, then mix it with some MIDI music to create your own multimedia art. (If you do, please do not send it to MSJ.)

To implement the TTS mixer, you'd have to implement a custom audio destination object to receive the spoken words. Microsoft supplies an audio destination object for the default multimedia wave-out device (sound card), but you can implement your own. The mixer would need an audio destination that mixes the TTS signal with background music. Your custom audio destination object would accept digital audio from the TTS engine and, for every sample received, would read a sample of equal duration from the wave file, add the amplitudes, and send the combined audio to the multimedia wave-out device-or perhaps write it to another wave file.

The same sort of handshaking goes on as with SR. You'd use a TTS engine enumerator object to find a TTS engine with the desired features, then hook the engine up to your custom audio destination. As part of the "hooking up," the TTS engine would have a dialogue with the audio destination object to find a common audio format, then set up an audio destination notification sink that your audio destination object would use to inform the engine when it starts or stops playing, or when your buffers are overflowing, and so on (see Figure 11). As with speech recognition, the handshaking happens invisibly to the app.

Figure 11 Low-level TTS Objects with Custom Audio Destination

The app registers a main notification sink that receives buffer-independent notifications, such as whether or not the engine is speaking, and what the lip-positions are. Lip positions are typically used to synchronize speech with animation or other real-time events.

When the mixing program is ready to mix, it passes the engine one or more text strings, which are "spoken" to the audio destination. While the voice text object accepts only text, the low-level API lets you send phonetic descriptions or tagged text as well. You might use phonetic information to ensure that foreign names such as Grbac or Tchlzinski are pronounced correctly; while tagged text can contain bookmarks or other special embedded codes that tell the TTS engine which words to emphasize, when to change its voice, how quickly to speak, how long to pause between words and so on. When the engine "speaks" it's really just sending digital audio to the audio destination object, which decides what to actually do with it. Instead of sending the audio to the sound card, your custom audio destination would mix it with background music.

If you want to know when particular words are being spoken, in order to synchronize the music to the words, you can insert bookmarks into your text and register a buffer notification sink for each text buffer you mix. When the TTS engine reaches a bookmark within the text, it calls functions in the buffer notification sinks. A bookmark is just a special tag embedded in the text (for example: "...\mrk=3453\...") that sends a notification to the app rather than being verbalized.

Reality and Some Words of Advice

Speech recognition and text-to-speech let you create programs that listen and talk. They add a whole new dimension to a user interface. Unfortunately, the technology is still a long way from Star Trek. Full dictation still requires very fast hardware such as a dual Pentium or P6. And simple speech recognition isn't good enough for some purposes, like dialing individual digits over the phone. Even if an engine gets 99 percent accuracy per digit, after the user speaks ten digits in a row, there's only a 90 percent chance they're all correct-and that doesn't include your calling-card access number! It might be a great way to meet new friends, but most people won't accept the error rate.

Occasionally apps like Clock will get a CommandRecognize for the wrong command. If another program is listening for "What pay is it?" there's a chance the engine might mess up. The percentage of time a recognizer gets the correct answer is called "accuracy." Accuracy depends a lot on what's being listened for. Like humans, computers tend to confuse similar-sounding words. If all the commands are relatively dissimilar, you can get pretty good accuracy, up to 98 percent. If that's not good enough, you can always try changing your commands to something else, like "What's the time?" or "Computer, what time is it?"

Another common problem is that SR engines like to hear. If a user says, "What mime is it?" or "What mines it?" there's a good chance the recognizer will hear "What time is it?" Occasionally, a user will say something completely different like "Go away, you slime," but the engine will again recognize "What time is it?" The ability for a recognizer to reject what the user said is called, not surprisingly, rejection. Unfortunately, SR engines aren't as good as humans at rejection. They're not as picky. You should take this into account when you design your speech app.

And of course, don't forget that while sound is becoming more and more prevalent on PCs, not everyone has a sound card, speakers, and a microphone. Even those who do may not want their computers jabbering away at them, or have to listen to themselves jabbering.

So while speech can be extremely useful in many places, it's best to use it sparingly. For all but very specialized applications, speech should be optional. And even if speech recognition advances to the point of Star Trek, there will still be places where it's inappropriate. In my opinion, you wouldn't want to write an action game that made the player say "fire" to shoot his weapon, because it would always be faster to press the trigger.

In the future, Microsoft will extend the Speech API to add intelligence to dictation systems so they don't just transcribe word-for-word, but act more like a real person. For example, rather than returning "October first nineteen ninety five" when these words are spoken, they'll come back as "October 1, 1995." Microsoft will also enhance the Voice Commands module to accept more natural speech, so it can recognize "Tell me the time" or "Give me the time" as equivalent forms of "What time is it?" without any extra work from the application. These improvements take advantage of advances in speech technology from independent engine vendors.

Where To Get It

I've only touched on some of the capabilities of the Microsoft Speech API. Complete details can be found in the Microsoft Speech SDK, which at the time of this writing was in final beta and should be released to manufacturing by the time this article appears in print. For now, Microsoft Voice is being distributed along with the SDK which should be available on the March '96 MSDN CD-ROM. This includes executables, which you may distribute royalty-free with applications or speech engines, as well as documentation, tools, and sample programs for you to use. Microsoft Voice will be distributed by OEMs with different machines and/or sound cards and may also be bundled with future Microsoft products. In the interim, if you want more information, or wish to obtain a copy of the SDK and Speech run times, send email to

Microsoft Voice

Paul DiLascia

When MSJ asked me to check out the latest in speech technology from Microsoft, I popped the Microsoft Voice floppy (actually, there are two) into my multimedia machine-a 486/66 with 28MB of RAM, a SoundBlaster 16 with cheapo speakers and a $10 microphone from Radio Shack-and typed SETUP.

After the usual installation wizard stuff, I got yet another icon added to my task bar (see Figure A). Clicking it gave me the menu in Figure B. I selected Properties and got a tabbed dialog that let me control various options, the most interesting of which is what voice I wanted my computer to have (see Figure C). There are several characters to choose from, with names like Deep Douglas, Eager Eddie and Grandpa Amos, all of whom sound like they're a few days shy of full recovery from a laryngectomy. Peter is the default and least grating among them-but just for fun, I selected Wanda, who sounds like a witch with her broom in the wrong place.

Figure A

Figure B

Figure C

When you first install Voice, you get a brief tutorial that asks you to say, "What can I say?" When you do, a list of voice commands pops up. You are then instructed to say "Close window." No matter how many times I did, that darned window just wouldn't go away! I kept getting the same sequence of ToolTip messages: "Heard. Not recognized. Please speak louder." When I yelled into the mike, I got the same sequence, sans "Please speak louder." I pressed Alt-F4 to close the window.

I fiddled around a bit-adjusted the input volume and gain, turned off my radio, held the mike close to my lips, and even "trained" Wanda to recognize my voice by repeating, at her request, the digits zero through nine plus nineteen short phrases including "Who am I?" which felt very existential. Eventually, I got it to work.

In fact, it worked pretty darn well! I was impressed. I said, "Start running Microsoft Word" in a normal voice and, sure thing, Voice launched Word! (When you install Voice, it scans your entire disk for programs and adds a "Start running X" command for every app it finds.) I said, "File New" and it created a new document. I said, "Switch to NDOS" and it switched to WinCIM. Well, that's OK, I can forgive Wanda for not knowing how to pronounce NDOS. I said "Next window" several times to cycle the windows until I got to my NDOS window. Just like pressing Alt-Tab. Wanda was able to consistently recognize other generic commands like "Close window," "Minimize window," "Press cancel," "Press enter," and "Show help."

Any time you run a program, Wanda automatically adds its menu to her repertoire, in effect turning any out-of-the-box Windows-based app into a speech app. I tried it on my TRACEWIN program from October's C/C++ column, and I was amazed that Wanda was able to recognize "Trace output off," "Trace output to window" and other TRACEWIN commands with no trouble. She's a pretty good listener, actually. Even if she can't talk too well. She had no problem recognizing my wife's voice, either-though I thought I detected a slight hint of jealousy in her responses, laryngectomy aside.

If you ever find yourself speechless, all you have to do is ask, "What can I say?" to get the window in Figure D, which lists everything you can say. I got global commands like "Show help" and "What can I say?" as well as TRACEWIN commands like "Trace output off."

Figure D

To check out text-to-speech, I opened my draft of this text, selected the first paragraph, and said, "Read selection." Wanda read it flawlessly in her raspy monotone, which by now seemed almost tolerable. She pronounced MSJ correctly as initials, 486/66 as "four-eighty-six slash sixty-six", lowered her voice when speaking parenthetically, and even converted $10 to "ten dollars." I did not fail to notice, however, that she pronounced the word "Microsoft" with suspicious clarity, leading me to suspect a few extra "if" statements in the code; whereas "SoundBlaster" came out like "SoudBlaster"-but then it turned out I had in fact misspelled it exactly that way! Now I started to feel downright uneasy-Wanda was already finding my flaws.

When I turned on keyboard commands, which let you enter text by spelling, things started turning surreal. I said, "Pee ay you el,"expecting to see my name, but it came out "88d." I figured that Microsoft needed to go back to the drawing board on that one. But no, it was my fault again; you have to use international alphabet mnemonics like Alpha, Bravo, Charlie, and so on to Zebra. Fortunately, I have my pilot's license, so I know that stuff by heart. I said, "Capital-papa alpha uniform lima," pausing several seconds between each word, and, sure enough, "Paul" typed itself magically into my doc! But when I spelled "DiLascia," the Find dialog popped up because Wanda thought I said "F3." Oh well, no one ever spells my name right anyway. Wanda got it the second time, but when I reached the "s" in "DiLascia", no matter how precisely I tried to enunciate "sierra," Wanda insisted on hearing it as "zero." At first, I took it as an insult, but then I realized she was just being her typical computer self, preferring digits to letters. So I forgave her. (I think I hurt her feelings, though, because after that she would every now and then for no apparent reason ask, via her ToolTip window, "Is your microphone plugged in?" There was nothing wrong with the microphone. I like to think she was just hinting that she wanted me to say something. As Mr. Rozak says in the article, speech engines like to hear.)

If you're wondering how well Wanda performs, well, I have to say she's in no danger of winning any speed dictation trophies. At best, she can handle about one command every five or ten seconds on my 486. Also, when Wanda listens, she gobbles CPU cycles the way Arnold Schwarzenegger gobbles roast beef sandwiches. Everything turns to molasses. To avoid processor gridlock, you can set things up so you have to press a key or move the mouse to the upper-left corner of your screen to make Wanda listen.

So, what's the bottom line? Well, I definitely wouldn't use Wanda to get any real work done unless I broke both my hands-and even then I'm not sure it wouldn't be faster to type with my elbows. But there's definitely some very real and impressive technology at work here. Text-to-speech is, not surprisingly, better than speech recognition. Maybe in another couple of years. But no matter how flawless the technology becomes, you won't ever catch me talking to my computer. It seems silly. TTS seems more useful. I can see having my computer read an article back to me, and I really like the way, even today, dictionary and encyclopedia programs can pronounce words and foreign place-names. And if they could just make Wanda sound a little more like Stevie Nicks, I might not mind her occasionally asking if my microphone is plugged in.

It sure makes for great demos, though. Just be careful whom you show it to. Now whenever I ask my wife when dinner'll be ready she says: "Heard. Not recognized."

From the January 1996 issue of Microsoft Systems Journal.

© 2017 Microsoft Corporation. All rights reserved. Contact Us |Terms of Use |Trademarks |Privacy & Cookies