Multimodal Processing of Human Behavior in Intelligent Instrumented Spaces: A Focus on Expressive Human Communication

May 5, 2008
Carlos Busso | USC Viterbi School of Engineering

Advances in technologies to capture and process multimedia signals are enabling new opportunities for understanding and modeling human behavior, and designing new human-centered applications. Intelligent environments equipped with a range of audio-visual sensors provide suitable means for automatically monitoring and tracking the behavior, strategies and engagement of the participants in multiperson interactions such as meetings, at various levels of interest. We describe a case study of a “Smartroom” being developed at USC in which high-level features are calculated from active speaker segmentations, automatically annotated by our system, to infer the interaction dynamics between the participants. The results show that it is possible to accurately estimate in real-time not only the flow of the interaction, but also how dominant and engaged each participant was during the discussion. Additionally, we describe analysis of human expressive behavior that can be afforded by such audio-visual data. We describe an analysis of the interrelation between facial gestures and speech using a multimodal approach. Using a controlled setting, motion capture technology was used to simultaneously acquire speech and detailed facial information. Our results indicate that the verbal and non-verbal channels of human communication are internally and intricately connected. The interplay is observed across the different communication channels such as various aspects of speech, facial expressions, and movements of the hands, head and body, and is greatly affected by the linguistic and emotional content of the message being communicated. As a result of the analysis, applications in automatic emotion recognition and synthesis of expressive communication are presented.

[This research was supported in part by funds from the NSF, NIH, and the Department of the Army]

Speaker Details

Carlos Busso received his B.S (2000) and M.S (2003) degrees with high honors in electrical engineering from University of Chile, Santiago, Chile. He is currently a Ph.D. candidate in electrical engineering at the University of Southern California (USC), Los Angeles, USA. Since 2003, he has been a research assistant in the Speech Analysis and Interpretation Laboratory (SAIL) at USC. He was selected by the School of Engineering of Chile as the best Electrical Engineer graduated in Chile in 2003. At USC, he received a Provost Doctoral Fellowship from 2003 to 2005 and a Fellowship in Digital Scholarship from 2007 to 2008. His research interests are in digital signal processing, speech and video processing, and multimodal interfaces. His current research includes modeling and understanding human communication and interaction, with applications to automated recognition and synthesis to enhance human-machine interfaces. He has worked on audio-visual emotion recognition, analysis of emotional modulation in gestures and speech, designing realistic human-like virtual characters, speech source detection using microphone arrays, speaker localization and identification in an intelligent environment, and sensing human interaction in multi-person meetings.