Przejdź do głównej zawartości
This page has been automatically translated by Microsoft Translator's machine translation service. Learn more

Microsoft Translator Blog

Microsoft Translator brings end-to-end speech translation to everyone with the world’s first Speech Translation API

Today, we released a new version of Microsoft Translator API that adds real-time speech-to-speech (and speech to text) translation capabilities to the existing text translation API. Powered by Microsoft’s state-of-the-art artificial intelligence technologies, this capability has been available to millions of users of Skype for over a year, and to iOS and Android users of the Microsoft Translator apps since late 2015. Now, businesses will be able to add these speech translation capabilities to their applications or services and offer more natural and effective user experiences to their customers and staff.

Speech translation is available for eight languages — Arabic, Chinese Mandarin, English, French, German, Italian, Portuguese and Spanish. Translation to text is available in all of Microsoft Translator’s 50+ supported languages. Translation to spoken audio is available in 18 supported languages.

This new version of Microsoft Translator is the first end-to-end speech translation solution optimized for real-life conversations (vs. simple human to machine commands) available on the market. Before today, speech translation solutions needed to be cobbled together from a number of different APIs (speech recognition, translation, and speech synthesis), were not optimized for conversational speech or designed to work with each other. Now, end users and businesses alike can remove language barriers with the integration of speech translation in their familiar apps and services.


How can my business use speech translation technology?

Speech translation can be used in a variety of person-to-person, group or human-to-machine scenarios. Person-to-person scenarios may include one-way translation such as personal translation, subtitling, or remote or in-person multi-lingual communications similar to what is currently found in Skype Translator or the Microsoft Translator apps for iOS and Android. Group scenarios could include real-time presentations such as event keynotes, webcasts and university classes, or gatherings such as in —person meetings or online gaming chatrooms. Human-to-machine scenarios could include business intelligence scenarios (such as the analysis or customer calls logs) or AI interactions.

We are just starting to scratch the surface of the scenarios where this technology will help and, as it is machine learning based, its quality and therefore applicability will improve with time as more people and companies are using it.

Several partner companies have tested the API and integrated it into their own apps:

  • Tele 2 of Sweden, a leading mobile operator with more than 15 million subscribers in over 15 countries, integrated Translator into their PBX to support real-time phone calls translations (no app needed!) on their cellular network.
  • LionBridge (Boston, MA), a language service provider and Gold Level Translator partner, developed an integrated video subtitling solution.
  • ProDeaf, an application vendor specializing in developing technologies to support the hard-of-hearing and deaf communities, integrated the new API into their sign language avatar app to enable multi-lingual support of speech to sign scenarios.


How does speech translation work?

Speech-to-speech translation is a very complex challenge. It uses the latest AI technologies, such as deep neural networks for speech recognition and text translation. There is no other fully-integrated speech translation solution available on the market today and delivering a platform that would support real-life speech translation scenarios required going beyond simply stitching together existing speech recognition and text translation technologies. There are four stages to speech translation to be able to deliver this experience:

  1. Automatic Speech Recognition (ASR) — A deep neural network trained on thousands of hours of audio analyzes incoming speech. This model is trained on human-to-human interactions rather than human-to-machine commands, producing speech recognition that is optimized for normal conversations.
  2. TrueText — A Microsoft Research innovation, TrueText takes the literal text and transforms it to more closely reflect user intent. It achieves this by removing speech disfluencies, such as “um”s and “ah”s, as well as stutters and repetitions. The text is also made more readable and translatable by adding sentence breaks, proper punctuation, and capitalization. (see picture below)
  3. Translation — The text is translated into any of the 50+ languages supported by Microsoft Translator. The eight speech languages have been further optimized for conversations by training on millions of words of conversational data using deep neural networks powered language models.
  4. Text to Speech — If the target language is one of the eighteen speech languages supported, the text is converted into speech output using speech synthesis. This stage is omitted in speech-to-text translation scenarios such as video subtitling.

How do I get started?

It’s easy to get started with the new Microsoft Translator Speech API. A free 10-hour trial is available at You can test out setup and implementation in a virtual environment as well as read the API documentation on our new Swagger page. You can also find example apps and other useful information on GitHub.

Of course, if you have questions, issues, or feedback, we’d love to hear it! You can let us know on our feedback and support forum.

Learn More