Accelerating Advanced Analytics

March 14, 2016
Arun Kumar | University of Wisconsin-Madison

Advanced analytics — the analysis of large and complex data using machine learning (ML) — is becoming ubiquitous, with a growing demand for advanced analytics tools in the enterprise domains. However, there exist several challenging bottlenecks in the end-to-end process of building and deploying advanced analytics applications. My research focuses on abstractions, algorithms, and systems to mitigate such bottlenecks and accelerate advanced analytics from a data management standpoint. In this talk, I will focus on my work on mitigating one such pervasive bottleneck in the process of feature engineering for ML — joins of multiple tables. Many real-world datasets are multi-table, connected by key-foreign key relationships, but almost all ML toolkits expect single-table inputs. This forces data scientists to join all tables and materialize a single table that collects all features. Alas, such joins often cause the output to blow up in size, which slows down ML, increases costs, and leads to data maintenance headaches. In my work, I show how it is possible to mitigate these issues by “avoiding joins physically,” i.e., pushing ML down through joins. This reduces runtime without affecting accuracy. Going further, I apply statistical learning theory to show how it is often possible to also “avoid joins logically,” i.e., ignore entire tables outright without losing much accuracy, but achieving significant runtime gains.