Abstract

Multiclass action detection in complex scenes is a challenging problem because of cluttered backgrounds and the large intra-class variations in each type of actions. To achieve efficient and robust action detection, we characterize a video as a collection of spatio-temporal interest points, and locate actions via finding spatio-temporal video subvolumes of the highest mutual information score towards each action class. A random forest is constructed to efficiently generate discriminative votes from individual interest points, and a fast top-K subvolume search algorithm is developed to find all action instances in a single round of search. Without significantly degrading the performance, such atop-K search can be performed on down-sampled score volumes for more efficient localization. Experiments on a challenging MSR Action Dataset II validate the effectiveness of our proposed multiclass action detection method. The detection speed is several orders of magnitude faster than existing methods.