This work studies the feasibility of using visual information to automatically measure the engagement level of TV viewers. Previous studies usually utilize expensive and invasive devices (e.g., eye trackers or physiological sensors) in controlled settings. Our work differs by only using an RGB video camera in a naturalistic setting, where viewers move freely and respond naturally and spontaneously. In particular, we recorded 47 people while watching a TV program and manually coded the engagement levels of each viewer. From each video, we extracted several features characterizing facial and head gestures, and used several aggregation methods over a short time window to capture the temporal dynamics of engagement. We report on classification results using the proposed features, and show improved performance over baseline methods that mostly rely on head-pose orientation.