Random Sampling for Data Intensive Computations

  • Dinkar Vasudevan ,
  • Milan Vojnovic

MSR-TR-2009-08 |

We consider estimation of arbitrary range partitioning of data values and ranking of frequently occurring items based on random sampling, within small number of samplings and prescribed accuracy. These problems arise in the context of parallel-processing of massive datasets, e.g. performed in data centers of Internet-scale cloud services and large-scale scientific computations. The range partitioning is a basic block of parallel-processing systems based on the paradigm of map and reduce.

For the range partitioning, we consider a direct estimation method based on constructing an arbitrary-height histogram and characterize the estimation error. This approach provides substantial savings in constructing unbalanced range partitionings with respect to a standard approach based on equi-height histograms; our results extend previous work restricted to equi-height histograms. For the problem of ranking of frequently occurring items, we use a lumping of small frequency items that enables us to obtain tighter bounds that are independent of the total number of distinct items in a dataset. The analysis deploys the framework of large deviations that is well suited to typically large scale of data in the considered applications.

We demonstrate tightness and benefits of our sampling methods using a large data set of an operational cloud service that involves data at a scale of hundreds of billions of records. Our results provides insights and inform design of practical sampling methods.