Partial-Duplicate Clustering and Visual Pattern Discovery on Web Scale Image Database
- Wei Li ,
- Changhu Wang ,
- Lei Zhang ,
- Yong Rui ,
- Bo Zhang
IEEE Transactions on Multimedia (TMM) |
In this paper, we study the problem of
discovering visual patterns and partial-duplicate images, which is
fundamental to visual concept representation and image parsing,
but very challenging when the database is extremely large, such
as billions of images indexed by a commercial search engine.
Although extensive research with sophisticated algorithms has
been conducted for either partial-duplicate clustering or visual
pattern discovery, most of them can not be easily extended to this
scale, since both are clustering problems in nature and require
pairwise comparisons. To tackle this computational challenge,
we introduce a novel and highly parallelizable framework
to discover partial-duplicate images and visual patterns in a
unified way in distributed computing systems. We emphasize the
nested property of local features, and propose the generalized
nested feature (GNF) as a mid-level representation for regions
and local patterns. Initial coarse clusters are then discovered
by GNFs, upon which -gram GNF is defined to represent
co-occurrent visual patterns. After that, efficient merging and
refining algorithms are used to get the partial-duplicate clusters,
and logical combinations of probabilistic GNF models are
leveraged to represent the visual patterns of partially duplicate
images. Extensive experiments show the parallelizable property
and effectiveness of the algorithms on both partial-duplicate
clustering and visual pattern discovery. With 2000 machines,
it costs about eight and 400 minutes to process one million and
40 million images respectively, which is quite efficient compared
to previous methods.