TQP++: Bridging ML Compilers and Analytical Query Processing on GPUs

The convergence of unified, cloud-native Lakehouse platforms such as Microsoft Fabric, and the widespread deployment of AI-optimized hardware in datacenters is driving the rise of GPU-based analytical engines. However, building a GPU query engine that is both competitive with hand-tuned implementations and portable across vendors (NVIDIA, AMD, custom silicon) remains an open challenge, as existing systems either lock into a single vendor’s toolchain or sacrifice performance for generality. In this paper, we present TQP++, an ML-compiler-native analytical query processor that repurposes ML compiler infrastructure to close this gap, achieving high performance and hardware portability from a single codebase. TQP++ integrates the Antares compilation framework with tiered GPU resource scheduling for SQL operators, a map-reduce-oriented fusion schema that eliminates intermediate materializations, and a multi-gated execution graph that adapts operator algorithms to runtime data characteristics. On TPC-H SF100, TQP++ executes all 22 queries in 1.1second on an A100 (7x faster than HeavyDB, 15x faster than CPU baselines) and under 0.7 seconds on H100 and MI300, while targeting 9 devices across 3 vendors—including Xbox—without code changes. To our knowledge, this is the most hardware-diverse GPU query processor reported to date, while achieving sub-second TPC-H SF100 on a single GPU.