“Uh, This One?”: Leveraging Behavioral Signals for Detecting Confusion during Physical Tasks

Proceedings of the 26th International Conference on Multimodal Interaction |

Publication

A longstanding goal in the AI and HCI research communities is building intelligent assistants to help people with physical tasks. To be effective in this, AI assistants must be aware of not only the physical environment, but also the human user and their cognitive states. In this paper, we specifically consider the detection of confusion, which we operationalize as the moments when a user is “stuck” and needs assistance. We explore how behavioral features such as gaze, head pose, and hand movements differ between periods of confusion vs no-confusion. We present various modeling approaches for detecting confusion that combine behavioral features, length of time, instructional text embeddings, and egocentric video. Although deep networks (e.g., V-Jepa) trained on full video streams perform well in distinguishing confusion from non-confusion, simpler models leveraging lighter weight behavioral features exhibit similarly high performance, even when generalizing to unseen tasks.