Consistent Weighted Sampling

Mark Manasse; Frank McSherry; Kunal Talwar

Consistent Weighted Sampling

Mark Manasse ,
Frank McSherry ,
Kunal Talwar

MSR-TR-2010-73 | June 2010

Download BibTex

ABSTRACT
We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between them: Pr[sample(S) = sample(T)] = sum_x min(S(x), T(x)) / sum_x max(S(x), T(x)) where sample(S) = (x, y) with 0 < y<= S(x). The sampling process takes expected time linear in the number of non-zero weights in S, independent of the weights themselves. We discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially. We additionally discuss a variety of settings in which weighted sampling with highly divergent weights is useful, for example TF-IDF style weighting for determining similarity of web pages, and privacy-preserving auctions.