FOA Tokenizer: Learning Discrete Representations of Spatial Audio with Multichannel VQ-GAN
- Parthasaarathy Sudarsanam, Tampere University; Hannes Gamper, Microsoft
Spatial audio captures the directional and environmental characteristics of sound, enabling immersive listening experiences. First-Order Ambisonics (FOA) provides a compact representation of spatial audio by encoding the sound field’s directional components across four channels, allowing full-scene coverage independent of microphone array geometry. A key advantage of FOA is its rendering flexibility. It can be decoded to any loudspeaker configuration, including stereo, surround, binaural, and custom arrays, making it highly suitable for diverse playback environments. Modeling FOA signals is therefore essential for immersive audio applications, yet remains challenging due to their high dimensionality and spatial complexity. Building upon the WavTokenizer framework, we introduce FOA Tokenizer, a multichannel VQ-GAN that learns discrete latent representations of FOA audio to support both discriminative and generative downstream tasks. The model achieves high compression, encoding 4-channel FOA audio at 24 kHz using only 75 tokens per second. To preserve spatial fidelity, we propose a spatial consistency loss that enforces directional coherence in the reconstructed audio. Our approach reconstructs spatial cues with high accuracy, achieving an absolute angular error of 14° on noisy reverberant data and 4° on clean, non-reverberant speech. This framework enables compact and spatially consistent representations of FOA audio, facilitating applications in sound source localization, synthesis, and immersive scene understanding.
-
-
Parthasaarathy Sudarsanam
Doctoral Researcher
Tampere University
-
Hannes Gamper
Principal Researcher
-
-