Spatial Audio Rendering for Speech Live Translation

  • Margarita Geleta, UC Berkeley

Language barriers in virtual meetings remain a persistent challenge to global collaboration. While real-time translation technologies offer a promising solution, their integration into conversational interfaces often neglects key perceptual cues. This study explores how spatial audio rendering of translated speech affects comprehension, cognitive load, and user experience in multilingual teleconferencing. We conducted a within-subjects experiment involving 8 confederates (speakers) and 47 participants (listeners) simulating global team meetings, using Wizard-of-Oz live English translations of conversations in Greek, Kannada, Mandarin Chinese, and Ukrainian—languages selected for their diversity in grammar, script, and resource availability. Participants experienced four audio conditions for the translated speech: spatial audio (aligned with the speaker’s on-screen location) with and without background reverberation, and two non-spatial configurations (diotic and monaural). We measured listener comprehension accuracy, NASA-TLX workload ratings, and satisfaction Likert scores, complemented by qualitative feedback.

Results show that participants listening to spatially-rendered translated speech were more than twice as likely to comprehend compared to non-spatial audio, and experienced a reduction in perceived listening effort of approximately 2.4%. Participants also reported greater clarity and engagement when spatial cues and voice timbre differentiation were preserved. We discuss design implications for integrating real-time translation into virtual meeting platforms, offering guidelines for delivering translated speech in ways that minimize cognitive load and improve conversational clarity. These findings advance best practices for inclusive, cross-language communication in telepresence systems.