Meeting Transcription Using Asynchronous Distant Microphones

Interspeech |

Organized by ISCA

We describe a system that generates speaker-annotated transcripts of meetings by using multiple asynchronous distant microphones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization, and system combination. While the idea of improving the meeting transcription accuracy by leveraging multiple recordings has been investigated in certain specific technology areas such as beamforming, our objective is to assess the feasibility of a complete system with a set of mobile devices and conduct a detailed analysis. With seven input audio streams, our system achieves a word error rate (WER) of 22.3% and a speaker-attributed WER (SAWER) of 26.7%, and comes within 3% of the close-talking microphone WER on non-overlapping speech. The relative gains in SAWER over a single-device system are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones, respectively.  The full system achieves a 13.6% diarization error rate, 10% of which are due to overlapped speech.