What information does the clustering of a finite data set reveal about the underlying distribution from which the data were sampled? This basic question has proved elusive even for the most widely-used clustering procedures. One natural criterion is to seek clusters that converge (as the data set grows) to regions of high density. When all possible density levels are considered, this is a hierarchical clustering problem where the sought limit is called the “cluster tree”.
This talk will describe two simple algorithms for estimating this tree that implicitly construct a multiscale hierarchy of near-neighbor graphs on the data points. We’ll show that these procedure are consistent, and give rates of convergence using a percolation argument that also gives insight into how neighborhood graphs should be constructed.