Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings

We consider the challenge of predicting the engagement of people with an open-world dialog system, where one or more participants may establish, maintain, and break the communication frame. We show in particular how a system can learn to predict an intention to engage from multiple observations that are extracted from a visual analysis of people coming into the proximity of a system.