Enabling Cross-Lingual Conversations in Real Time
Sometimes, a dream needs but a nudge to propel it toward reality.
Rudder, then Microsoft’s chief research and strategy officer, and Lee, head of Microsoft Research, had been among those who had seen demonstrations of a research project designed to provide fluent, cross-lingual conversations between speakers of different languages. The demos were tantalizing, offering the potential to smash through the language barriers that have hindered human communication for thousands of years.
Rudder and Lee were convinced: The time had come to make it real.
That task fell to, among others, the Machine Translation team, based in Redmond and headed by Arul Menezes.
The success of the team’s progress to date was on display May 27, in a talk by Microsoft CEO Satya Nadella in Rancho Palos Verdes, Calif., during the Code Conference. During Nadella’s conversation with Kara Swisher and Walt Mossberg of the Re/code tech website relating to a new era of personal computing, he asked Gurdeep Pall to join him on stage. Pall, the Microsoft corporate vice president of Skype, demonstrated for the first time publicly the Skype Translator app, with Pall conversing in English with German-speaking Microsoft employee Diana Heinrichs.
The interaction underscored the steady advancement the team has achieved.
“We felt speech translation was a very natural evolution of the text-translation work we’ve been doing,” says Chris Wendt, program manager of the Machine Translation team, who had long been pushing for such an incubation effort.
“It’s an exciting project,” he says, “and it became clear that adding this capability to Skype and enabling people to have translated conversations was the killer scenario to get this technology into customers’ hands.”
There have been many attempts over the years, several within Microsoft Research, to demonstrate such aspects of translating human speech. But delivering something that is usable in real life, to fit the voice and utterances of many different users and the nuances of different languages—all of it built at scale to serve Skype users—has been considered a nearly impossible task.
Making Skype Translator available first on Windows 8 later this year as a limited beta has required a series of remarkable research advances in translation, speech recognition, and language processing, combined with contributions from Microsoft engineering and research teams near and far.
The Machine Translation team, already responsible for the state-of-the-art translation technology and the cloud service used extensively in a number of Microsoft products, has taken a One Microsoft approach to the challenge by utilizing contributions from researchers and engineers working on Microsoft’s speech service. Additionally, to develop a new, yet natural user experience, the team developed a deep partnership with Skype’s designers and engineers, particularly the prototyping team led by Jonas Lindblom.
“The interesting thing about this project,” Menezes says, “is we’ve got these two fairly complex technologies coming together for the first time to provide this end-to-end user experience.”
Microsoft Research has been focused on machine translation for more than 10 years. Initial results came with translations for Microsoft’s product-support Knowledge Base. The technology became available for public use as the engine behind Bing Translator, spurring the tool’s growth and popularity since its launch. Today, Bing Translator is used by millions and embraced by organizations worldwide.
Along the way, Menezes and Microsoft colleagues have addressed significant system and user-interface design challenges, including reducing latency and developing visual feedback so the translation system is continuously improving itself using user feedback.
“The technology is only as good as the data,” Menezes says. “One big focus has been to scale up the amount and kinds of data that go into the machine-learning training of these systems.”
The data need was felt acutely at Microsoft Research’s lab in Beijing, where speech researcher Frank Seide has been gradually increasing the number of hours of conversational data he is using to fine-tune his model-based training approach. Work that began with a set of 24 hours of such data now has increased significantly. Seide’s state-of-the-art conversational-speech models were critical in enabling the seamless communication demonstrated by Pall.
Conversational training data is hard to obtain, and the team had to develop new techniques to collect conversational speech data. One example, Menezes says, came from analysis of social-media posts.
“It gives us exposure to yet another genre,” he says. “How people write on social media is not how they speak, but still, there’s some crossover of slang and related utterances that can help this system and make it contemporary.”
There’s also the issue of “disfluency,” the difference between how people write and talk. When talking, people use lots of pauses and meaningless utterances that bridge gaps between their thoughts.
“It’s not just repeating a single word,” Menezes says. “Sometimes, you’ll go three words into a sentence and then back up and restate it. In some languages, it’s more of a challenge than others, especially languages like Spanish, where words have to agree in grammatical gender.”
Untangling such conversations requires lots of training. So does determining where sentence breaks occur. The sentence is the basic unit in translation, and without punctuation, it can be difficult to identify. In fact, it can be next to impossible to read a transcript of a conversation lacking punctuation. The translator must learn to segment out the speech input, too.
“That’s one of the things over the last year that my team’s been doing, resolving the mismatch between the way people talk and the way they write,” Menezes says. “If your translation system is focused on written text, it works very poorly with spoken language.”
In addition to machine translation, the speech-translation work has relied on years of research and engineering advances in several other areas, most notably speech recognition, an area in which speech researchers from Microsoft’s Beijing and Redmond labs have played pivotal roles.
The path to the Skype Translator gained momentum with an encounter in the autumn of 2010. Seide and colleague Kit Thambiratnam had developed a system they called The Translating! Telephone for live speech-to-text and speech-to-speech translation of phone calls.
Another major step came from work by Dong Yu and Deng with master-level student intern George Dahl using senones, small pronunciation fragments, as the training targets for neural networks. That led to a major boost in performance, improving speech-recognition accuracy by 16 percent.
“I was very excited,” smiles Yu, a researcher based in Redmond, “mainly because I had been working on this area for a long, long time, and I knew that improvement at that scale had never happened in my career.”
Seide was impressed and intrigued by the new performance levels, and a close collaboration began. After some new developments, accuracy improved to 24 percent. Further tweaking eventually resulted in a 42 percent gain. They were onto something.
At that point, Seide had to re-engineer his infrastructure to handle large-scale tasks. He and his Beijing colleague Gang Li worked on developing the core speech-recognition technology, especially on building new models, each of which can take weeks.
A core development that enables Skype translation came from Redmond researcher Li Deng. He invited Geoff Hinton, a professor at the University of Toronto, to visit Redmond in 2009 to work on new neural-network learning methods, based on a couple of seminal papers from Hinton and his collaborators in 2006 that had brought new life to the use of neural networks in signal modeling and classification.
What had changed? For one thing, the proliferation of big data had given neural networks new life. Certain machine-learning methods, dormant for years, gained new currency.
“Deepness” also played a role, Deng says. In the ’90s, it was thought the key to the neural-network approach to speech recognition was a big, hidden layer of data that could be used to approximate results to improve speech recognition.
“People in those early days didn’t realize that using many layers was becoming very important,” Deng says. “But that’s what many parts of the structure in the brain look like.”
In January 2012, Microsoft researchers co-authored a paper called Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, published in IEEE Transactions on Audio, Speech, and Language Processing. This was the first time the term “deep neural networks” had been applied to such work.
Nowadays, that coinage has become common in speech-recognition circles. Four years ago, the IEEE International Conference on Acoustics, Speech, and Signal Processing featured three papers on deep neural networks. This year’s conference included almost a hundred.
That journal paper, co-written by Yu and Deng, among others, has been cited so often by others that it recently won the 2013 IEEE Signal Processing Society Best Paper Award, a test-of-time honor recognizing top papers of the past five years.
Using the senone as the primary unit in speech decoding paid another dividend. Major companies—including Microsoft—had invested heavily in building the ability to handle speech. Those infrastructures were built before the advent of deep neural networks. Previous models had been based on the Gaussian Mixture Model, which delivered a senone as the output.
The DNN approach also featured a senone as the output, thus enabling efficient reuse of existing computing infrastructure.
These breakthroughs already have paid dividends. Microsoft’s Bing Platform Group is using them to deliver best-in-class speech-recognition experiences for Cortana and other speech-powered experiences. With recognition accuracies approaching human capabilities, this close partnership between the Skype, Microsoft Research, and Bing Platform groups is critical in delivering a translation service to Skype users worldwide.
A critical test of this One Microsoft approach to developing Skype Translator occurred on Oct. 25, 2012, in Tianjin, China, during a Microsoft Research computing conference. In a keynote address by Rick Rashid, then the worldwide head of Microsoft Research, he publicly debuted the speech-to-speech translation project.
“I made sure to include pauses after each sentence,” Rashid recalls, “so that the audience would have time to clearly hear the Mandarin version of what I was saying. This also meant there was plenty of time for the audience to react. I remember hearing some gasps from the front rows, along with general applause and approval from the audience. It was quite moving.”
That marked a key moment in the evolution of both speech recognition and machine translation.
“I think the demo was a clear harbinger of the arrival both of deep-neural-network speech recognition and real-time language translation,” Rashid says. “It set a bar for what could be done and provided a glimpse into a future where language no longer need be a barrier.
“I’ve been a fan of science fiction from my pre-teen years. Seeing some of the technology that was once purely fiction become reality was exhilarating!”
Now, Wendt is one of those bringing this technology to the masses. He helps spearhead the One Microsoft approach to Skype Translator’s development, including working closely with Skype colleagues Lindblom, Daniel Nitsche, and Fredrik Furesjö. With help from Skype’s Steve Pearce and Redmond researcher Shamsi Iqbal, the team was able to fine-tune the interaction model for the experience.
“What I found interesting as the team overcame the many design, research, and engineering challenges to get this far is how the dynamic between us and the technology changed with the introduction of a more natural way to communicate,” says Vikram Dendi, who had been part of the Machine Translation team before becoming Lee’s technical and strategy adviser. “With traditional text translation, as a user, it was easy for me to blame the technology when it didn’t get something right and just give up.
“What’s been fascinating is the willingness for the two parties trying to communicate with each other to work with the technology to help each other understand in the case of speech translation. Even in beta, this makes this technology very useful and usable.”