Quickr: Lazily Approximating Complex Ad-Hoc Queries in Big Data Clusters

Srikanth Kandula; Anil Shanbhag; Aleksandar Vitorovic; Matthaios Olma; Robert Grandl; Surajit Chaudhuri; Bolin Ding

Quickr: Lazily Approximating Complex Ad-Hoc Queries in Big Data Clusters

Srikanth Kandula ,
Anil Shanbhag ,
Aleksandar Vitorovic ,
Matthaios Olma ,
Robert Grandl ,
Surajit Chaudhuri ,
Bolin Ding

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2016) | June 2016

Published by ACM - Association for Computing Machinery

Download BibTex

We present a system that approximates the answer to complex ad-hoc queries in big-data clusters by injecting samplers on-the-fly and without requiring pre-existing samples. Improvements can be substantial when big-data queries take multiple passes over data and when samplers execute early in the query plan. We present a new universe sampler which is able to sample multiple join inputs. By incorporating samplers natively into a cost-based query optimizer, we automatically generate plans with appropriate samplers at appropriate locations. We devise an accuracy analysis method using which we ensure that query plans with samplers will not miss groups and that aggregate values are within a small ratio of their true value. An implementation on a cluster with tens of thousands of machines shows that queries in the TPC-DS benchmark use a median of 2x fewer resources. In contrast, approaches that construct input samples even when given 10x the size of the input to store samples improve only 22% of the queries, i.e. a median speed up of 0x.

SIGMOD talk slides.

Some errata in the conference version and proofs are here.

Ships with Azure Data Lake Analytics; reference documentation is here (opens in new tab)and there (opens in new tab).