Abstract

Early recognition and prediction of human activities are of
great importance in video surveillance, e.g., by recognizing a
criminal activity at its beginning stage, it is possible to avoid
unfortunate outcomes. We address early activity recogni-
tion by developing a Spatial-Temporal Implicit Shape Mod-
el (STISM), which characterizes the space-time structure of
the sparse local features extracted from a video. The ear-
ly recognition of human activities is accomplished by pat-
tern matching through STISM. To enable efficient and ro-
bust matching, we propose a new random forest structure,
called multi-class balanced random forest, which makes a
good trade-off between the balance of the trees and the dis-
criminative abilities. The prediction is done simultaneously
for multiple classes, which saves both the memory and com-
putational cost. The experiments show that our algorithm
significantly outperforms the state of the arts for the human
activity prediction problem.