Models that captures the common structure of an object class have appeared few years ago in the literature (Jojic and Caspi in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 212–219, 2004; Winn and Jojic in Proceedings of International Conference on Computer Vision (ICCV), pp. 756–763, 2005); they are often referred as “stel models.” Their main characteristic is to segment objects in clear, often semantic, parts as a consequence of the modeling constraint which forces the regions belonging to a single segment to have a tight distribution over local measurements, such as color or texture. This self-similarity within a region in a single image is typical of many meaningful image parts, even when across different images of similar objects, the corresponding parts may not have similar local measurements. Moreover, the segmentation itself is expected to be consistent within a class, although still flexible. These models have been applied mostly to segmentation scenarios.

In this paper, we extent those ideas (1) proposing to capture correlations that exist in structural elements of an image class due to global effects, (2) exploiting the segmentations to capture feature co-occurrences and (3) allowing the use of multiple, eventually sparse, observation of different nature. In this way we obtain richer models more suitable to recognition tasks.

We accomplish these requirements using a novel approach we dubbed stel component analysis. Experimental results show the flexibility of the model as it can deal successfully with image/video segmentation and object recognition where, in particular, it can be used as an alternative of, or in conjunction with, bag-of-features and related classifiers, where stel inference provides a meaningful spatial partition of features.