Microsoft’s contribution to this field is “Whistler” (Windows Highly Intelligent STochastic taLkER), a trainable text-to-speech engine which was released in 1998 as part of the SAPI4.0 SDK, and then as part of Microsoft Phone and Microsoft Encarta and Windows 2000 and Windows XP operating systems. You type words on your keyboard, and the computer reads them back to you almost immediately. While it still has that distinct machine sound, it’s a big improvement on the flat, robotic voices of the past, particularly when large voice inventories are used.
Many of the improvements in speech synthesis over the past years have come from creative use of the technologies developed for speech recognition. The Whisper speech recognition engine isolates the sounds, called phonemes, which make up human speech. Counting each sound that each vowel can make, plus consonants by themselves and in combination and the all-important placeholder: silence, there are about 40 phonemes for English. But each phoneme has a different sound depending on what comes before and after it: the “o” in “hold” is longer than the “o” in “hot.” So it turns out that English speech consists of roughly 64,000 different phoneme variations, called phonemes in context or allophones. The Whistler and Whisper engines use a simplified database of about 3,000 allophones, which were isolated by cutting digital waveform recordings of the human voice into sections. The sections were organized into databases for use by the speech recognition engine.
Senior Researcher Alex Acero saw that “the tools were just lying there,” to build a speech synthesis device. The researchers, who included Scott Meredith, Mike Plumpe and Xuedong Huang, combined those phoneme databases with a text analyzer to make Whistler, which combines those recorded sounds back into words and phrases.
Just as a shattered plate that’s been glued back together again doesn’t look quite right, a word or phrase that’s been assembled from phonemes often sounds a little off pitch. The bigger the segment of sound, the more natural it sounds in the reconstruction, but using syllables or whole words as the building blocks would require a vast database. Because of product limitations the versions of Whistler shipped in Windows could only include an voice inventory slightly above 1MB. While it still has that distinct machine sound, it’s a big improvement on the flat, robotic voices of the past, particularly when on our laboratory versions that use large voice inventories.
The inflection in the speaker’s voice is often the key to understanding the meaning of a spoken phrase. We learn inflections as children by imitating the speech patterns of our elders until they are ingrained as an accent. The nuance that a native speaker picks up from the tone of another’s voice is difficult to impart to a non-native speaker, let alone a computer. Researchers had to add prosody, the pitch and duration of sounds that give them additional meaning, to make Whistler’s voice sound more natural and pleasant. Singing speech synthesizers sound better because the prosody is already specified by the song.
Singing Speech Synthesizers
Most people think of speech synthesis as having your computer speak to you. Proofing data entry, reading files and speaking prompts have been typical applications for speech synthesizers. Although synthesis technology is well suited for these traditional operations, here at Microsoft Research we are continually exploring new and exciting applications of our base technologies.
In addition to speaking, another popular use of human speech is singing. During the past 50 years, music synthesizers have developed to where they can imitate almost any acoustic instrument. Any acoustic instrument, that is, except for the most popular instrument – human vocals. “And there’s a good reason for this, singing is the most complex and dynamic of all musical instruments”, says Mark Cecys, the researcher who worked on this project. With the recent advances in computing and speech technology, we are finally moving beyond this limitation. Besides playing the instrumental parts, music synthesizers can now begin to sing the lyrics.
The Whistler Music Synthesizer
To demonstrate the potential of Microsoft’s Whistler speech technology for musical applications, a novel music synthesizer was designed. Running in real-time on Win32, the Whistler speech engine was combined with a software wavetable synthesizer. The wavetable synthesizer plays the instrumental accompaniment while Whistler sings the lyrics. The notes and lyrics are entered using a commercial MIDI editor then exported as a Standard MIDI File to the synthesizer for fine tuning and musical playback.
The following examples use Whistler’s stock “Mark” and “Melanie” voices for all the vocals. A key feature of Whistler technology is modeling the particular characteristics of real human speakers. In other words, after analyzing a specific speaker’s voice, Whistler can faithfully reproduce the voice characteristics, sounding very close to the original speaker (or singer!).
Although the synthesizer output is 16-bits, 44.1 kHz stereo, to reduce download time, all examples have been scaled back to 8-bits, 22 kHz mono.
- Mark and Melanie duet of Unexpected Song (WAV, 951K)
- Mark singing Penny Lane (WAV, 1.1M)
- Melanie’s first song (WAV, 641K)
The difference between a person and a talking computer is that the person understands the ideas and emotions conveyed through speech, and the computer doesn’t. This is part of the larger problem of artificial intelligence, which is what “2001” author Arthur C. Clarke imagined in HAL. Our ability to replicate our own minds in a machine is limited by our incomplete knowledge of how our own minds work. The ultimate goal for speech synthesis, as with all AI applications, is to make it pass the Turing Test – a blindfolded user shouldn’t be able to tell whether he is talking to a human or a machine. Like the voice of HAL, that’s a long way away. But Acero believes he knows how to get there: “I’m interested in using what I’ve learned in speech synthesis to modify speech recognition,”; he says. “Ultimately the right model might just be the same for both synthesis and recognition.” After all, he notes, our brains perform these functions simultaneously.