Towards Understanding the Performance of FPGAs using OpenCL Benchmarks

Hamid Reza Zohouri; Naoya Maruyama; Aaron Smith; Motohiko Matsuda; Satoshi Matsuoka

Towards Understanding the Performance of FPGAs using OpenCL Benchmarks

Hamid Reza Zohouri ,
Naoya Maruyama ,
Aaron Smith ,
Motohiko Matsuda ,
Satoshi Matsuoka

HiPEAC Workshop on Reconfigurable Computing | March 2016

Download BibTex

We evaluate the performance of a subset of the benchmarks available in the Rodinia Suite, using Altera’s OpenCL SDK and the Terasic DE5-Net FPGA board, equipped with an Altera Stratix V GXA7 FPGA, and present timing and power estimation results and comparison with a modern CPU and GPU. The results are presented for multiple versions of each benchmark, each with a varying degree of optimization for FPGAs, ranging from direct ports from the initial OpenCL implementation to loop-pipelined kernels specifically optimized for FPGAs. Our results show that, while it is possible to use a common programming language available for other more-widely used accelerators in HPC, the implementation method optimal for FPGAs is significantly different from those for other accelerators such as GPUs. Specifically, we find that multithreaded kernels typically used for GPUs do not perform as efficiently as those optimized with FPGA-specific optimizations such as sliding windows. However, by exploiting the FPGA-specific optimizations, FPGA with OpenCL shows promising performance. Our results using the Altera Stratix V 5SGXA7 FPGA indicate that, with FPGA-specific optimizations, it is possible to achieve up to 3.9x better power efficiency in comparison to an Nvidia K20C GPU.