Skip to main content

Microsoft Translator Blog

Microsoft Translator publicly releases speech translation corpus

Christian Federmann, senior program manager

As part of an ongoing effort within Microsoft to improve the accuracy of artificial intelligence (AI) systems, Microsoft Translator is publicly releasing a set of data that includes multiple conversations between bilingual speakers who are speaking French, German and English.

This corpus, which was produced by Microsoft using bilingual speakers, aims to create a standard by which people can measure how well their conversational speech translation systems work. It can serve as a standardized data set for testing bilingual conversational speech translation systems such as the Microsoft Translator live feature and Skype Translator.

Christian Federmann, a senior program manager working with the Microsoft Translator team, said there aren’t as many standardized data sets for testing bilingual conversational speech translation systems. “You need high-quality data in order to have high-quality testing,” Federmann said.

The Microsoft team hopes the corpus, which is freely available, will benefit the entire field of conversational translation and help to create more standardized benchmarks that researchers can use to measure their work against others.

“This helps propel the field forward,” said Will Lewis, a principal technical program manager with the Microsoft Translator team who also worked on the project.

Download the Microsoft Speech Language Translation corpus here.

Learn more about this release as well as other ways Microsoft is working to make AI smarter and more accurate in the Microsoft Research blog.


Learn More