This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc.
By only storing b bits of each hashed value (e.g., b = 1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b = 1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b = 64 (or b = 32), if one is interested in resemblance > 0.5. Our theoretical results are validated using a proprietary collection of 106 news articles and a public dataset of 300.000 articles.