This project is focused on optimizing the resource usage of bigData jobs. Standard database query optimization focuses on finding the best query execution plan, given fixed hardware resources. In BigData settings, both pay-as-you-go clouds and on-prem shared clusters, a complementary challenge emerges: Resource Optimization: find the best hardware resources, given an execution plan. In this world, provisioning is almost instantaneous and time-varying resources can be acquired on a per-query basis. This allows us to optimize allocations for completion time, resource usage, dollar cost, etc. These optimizations have a huge impact on performance and cost.
One of the biggest challenges in doing resource optimization is to understand the impact of a change in resource allocation on the performance of a BigData job. This task is challenging for users and tools alike due to lack of good statistics (high-velocity, unstructured data), frequent use of UDFs, impact on performance of different hardware types and a lack of understanding of parallel execution at such a scale.
We address this in our system with PerfOrator, a novel approach to resource-to-performance modeling. PerfOrator employs nonlinear regression on profile runs to model arbitrary UDFs, calibration queries to generalize across hardware platforms, and analytical framework models to account for parallelism. The resulting estimates are orders of magnitude more accurate than existing approaches (e.g, Hive’s optimizer), and have been successfully employed in two resource optimization scenarios: 1) optimize provisioning of clusters in cloud settings—with decisions within 1% of optimal, 2) reserve skyline of resources for SLA jobs—with accuracies over 10x better than human experts.