Duolingo, the popular language-learning app, has been using artificial intelligence (AI) to enhance the learner experience and bring free education to everyone. As a startup, Duolingo leveraged AI to achieve its mission of making language learning fun. Now as a multimillion-dollar business, Duolingo is sharing the technology and engineering decisions that proved instrumental in making it an iconic brand.
One key area where Duolingo has been using AI is speech technology. We sat down with Fabio Lessa, Senior Director of Engineering, and Kevin Lenzo, Speech Lab Lead at Duolingo, for their insights into this technology. Fabio leads the team responsible for the services, such as cloud infrastructure and data management, needed to operate Duolingo successfully. Kevin leads the team that keeps speech technology working in all languages and diverse scenarios, with a specific focus on speech recognition and synthesis for speaking and listening exercises. Together, their teams work to make Duolingo more engaging and effective for learners everywhere in the world using AI and machine learning.
On Duolingo and its approach to speech AI for core strategy
Fabio Lessa: “Speech is an important part of learning a language. One thing that we’ve always focused on has been to make language learning fun, so that everyone wants to practice every day. This is where our characters, and their personalities, come in. We work hard to make sure that the voice is matched to the character personality to give that extra level of sophistication and polish to the app and make the experience more delightful. These characters have helped Duolingo become the iconic brand it is today, and getting the voice of these characters right is crucial.”
Kevin Lenzo: “That’s where technical challenge began—how to make text-to-speech voices that fit these characters. This is complex—teaching language with the degree of precision that we need in text-to-speech. We’re a pretty small team. We have a strong background in natural language processing, but we have relatively few speech technology-oriented scientists.
“We were looking for solutions and knew that Microsoft had some of the best technology and experience with text-to-speech, so we decided to partner with them. By working with Microsoft, we were able to use their custom neural voice services to create unique text-to-speech voices for each of our characters. This lets us give our characters a distinct personality and make every lesson more engaging for learners.”
Partnering with Microsoft to build MVP and scale
Kevin Lenzo: “The first step was evaluating the technological landscape and determining which provider would best fit their needs for voice building. We ultimately settled on Microsoft’s Cognitive Services. From there we needed to design what we needed in these voices. We were designing for language learning, and so we knew that we needed to have phonetic coverage and positive coverage, among others. We had to design for entire courses, which included isolated words, questions, exclamations et cetera. We recorded 6,000 sentences on the representative set of materials, which was above baseline but necessary.
“We stood up the infrastructure to serve up text-to-speech audio for the course content and implemented a new level of content management to ensure consistency between the character personalities and their speech. Quality assurance was also an important aspect of the process, including detecting problems and finding fixes by using markup or removing bad data. On top of that, we monitored everything to detect problems as quickly as possible.
“One of the features to bring our costs down was cross-lingual technology. Once we had our high-quality flagship voices in five languages, we could almost double the number of languages that we had, using the base voice characteristics of the original. This really helped us scale quickly!”
Understanding industry trends and when to build, borrow, or buy technology
Kevin Lenzo: “If you look at the rate of adoption for any technology, on the low end of adoption, there are a bunch of academics emailing their papers to each other. Over time, more people pick it up and value-add starts to happen. At this point, there’s a sweet spot for a company that wants to follow that technology up. Duolingo was at the right place at the right time for language learning technology. Speech technology is a key component of our language learning app, and Microsoft has a strong track record and some of the best technology with neural voices. So, instead of building new character voices ourselves and making this from scratch, we partnered with Microsoft. We could focus on the characters themselves, the content, and on our core mission of making language learning fun and accessible to everyone, with high-quality custom voices in our app.”
Duolingo’s use of AI to build their business and brand is a great example of how startups can leverage AI to enhance the user experience. With the recent advances in AI, like Vall-e by Microsoft Research or Custom Neural Voice in Azure, startups are positioned better than ever to capture that niche business need in the market. How will you launch with AI when building your next venture?
For more tips on leveraging AI for your startup and for access to Azure’s AI services, sign up today for Microsoft for Startups Founders Hub.