Exploring cross-dataset generalization of Speech Emotion Recognition models

J. Acoust. Soc. Am. | , Vol 157: pp. A76-A77

ISBN: 978-9-46-459362-4

Speech Emotion Recognition (SER) is a technology that enables machines to identify, interpret, and respond to emotional nuances of human speech. Its role in enhancing human-computer interactions becomes increasingly apparent as we prioritize the development of more intuitive and empathetic AI systems. Despite the high performance achieved by prior work on individual datasets, a critical challenge remains: cross-dataset generalization. This aspect continues to pose significant challenges and hinders technology adoption. In this study, we investigate the cross-dataset generalization capabilities of various SER models. We build on our prior observation that large pre-trained models can result in fragmented class representations and further explore model capabilities in a multi-corpora learning paradigm, toward constructing corpus-independent class representations. We utilize audio-only and joint language-audio representation learning including Wav2vec, VGGish, WavLM, and CLAP. Additionally, we explore the impact of class-agnostic data augmentation in this multi-corpora training setting. Our experiments reveal significant insights into the robustness of these embeddings for cross-dataset generalization. The findings underscore the importance of evaluating SER models beyond single-dataset performance to ensure applicability in real-world scenarios. Additionally, this comprehensive evaluation helps confirm whether the models are learning the intended features and behaviors, thereby enhancing their reliability and effectiveness