Consistent Weighted Sampling

  • Mark Manasse ,
  • Frank McSherry ,
  • Kunal Talwar

MSR-TR-2010-73 |

ABSTRACT
We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T, the probability that the two choose the same sample is equal to the Jaccard similarity between them: Pr[sample(S) = sample(T)] = sumx min(S(x), T(x)) / sumx max(S(x), T(x)) where sample(S) = (x, y) with 0 < y<= S(x). The sampling process takes expected time linear in the number of non-zero weights in S, independent of the weights themselves. We discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially. We additionally discuss a variety of settings in which weighted sampling with highly divergent weights is useful, for example TF-IDF style weighting for determining similarity of web pages, and privacy-preserving auctions.