Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
- Vikas Joshi ,
- Amit Das ,
- Eric Sun ,
- Rupesh Mehta ,
- Jinyu Li ,
- Yifan Gong
Interspeech 2021 |
Improving multilingual end-to-end (E2E) automatic speech recognition (ASR) systems have manifold advantages. They simplify the training strategy, are easier to scale and exhibit better performance over monolingual models. However, it is still challenging to use a single multilingual model to recognize multiple languages without knowing the input language, as most multilingual models assume the availability of the input language. In this paper, we introduce multi-softmax model to improve the multilingual recurrent neural network transducer (RNN-T) models, by having language specific softmax, joint and embedding layers, while sharing rest of the parameters. We extend the multi-softmax model to work without knowing the input language, by integrating a language identification (LID) model, that estimates the LID on-the-fly and also does the recognition at the same time. The multi-softmax model outperforms monolingual models with an average word error rate relative (WERR) reduction of 4.65% on Indian languages. Finetuning further improves the WERR reduction to 12.2%. The multi-softmax model with on-the-fly LID estimation, shows WERR reduction of 13.86% compared to the multilingual baseline.