Stencil Computation Auto-tuning on Modern Multicore Architectures

May 22, 2009
Kaushik Datta | U.C. Berkeley

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies in order to build an auto-tuning environment that minimizes runtime while maximizing performance portability. Within this auto-tuning framework, we also employ techniques like machine learning to find near-optimal parameter configurations quickly. To evaluate the effectiveness of these strategies, we explored the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.

Speaker Details

Kaushik Datta is currently working on tuning stencil codes for his Ph.D. thesis at U.C. Berkeley. This work includes incorporating both algorithmic and architecture-specific optimizations into an auto-tuning framework so as to achieve performance portability. Previously, he also worked on writing and tuning the NAS Parallel Benchmarks in the Titanium language. His interests include high performance computing, auto-tuning, and parallel (multicore) programming.