MacroSS: Macro-SIMDization of Streaming Applications

Amir Hormati; Yoonseo Choi; Mark Woh; Manjunath Kudlur; Rodric Rabbah; Trevor Mudge; Scott Mahlke

MacroSS: Macro-SIMDization of Streaming Applications

Amir Hormati ,
Yoonseo Choi ,
Mark Woh ,
Manjunath Kudlur ,
Rodric Rabbah ,
Trevor Mudge ,
Scott Mahlke

Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010) | January 2010

Download BibTex

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts of a program that will benefit from a particular SIMD engine. The focus of this paper is on the problem of SIMDization for the growing application domain of streaming. Streaming applications are an ideal solution for targeting multi-core architectures, such as shared/distributed memory systems, tiled architectures, and single-core systems. Since these architectures, in most cases, provide SIMD acceleration units as well, it is highly beneficial to generate SIMD code from streaming programs. Specifically, we introduce MacroSS, which is capable of performing macro-SIMDization on high-level streaming graphs. Macro-SIMDization uses highlevel information such as execution rates of actors and communication patterns between them to transform the graph structure, vectorize actors of a streaming program, and generate intermediate code. We also propose low-overhead architectural modifications that accelerate shuffling of data elements between the scalar and vectorized parts of a streaming program. Our experiments show that MacroSS is capable of generating code that, on average, outperforms scalar code compiled with the current state-of-art autovectorizing compilers by 54%. Using the low-overhead data shuffling hardware, performance is improved by an additional 8% with less than 1% area overhead.