From high-level deep neural models to FPGAs

Hardik Sharma; Jongse Park; Divya Mahajan; Emmanuel Amaro; Joon Kyung Kim; Chenkai Shao; Asit Mishra; Hadi Esmaeilzadeh

From high-level deep neural models to FPGAs

Hardik Sharma ,
Jongse Park ,
Divya Mahajan ,
Emmanuel Amaro ,
Joon Kyung Kim ,
Chenkai Shao ,
Asit Mishra ,
Hadi Esmaeilzadeh

2016 International Symposium on Microarchitecture | October 2016

Published by IEEE Press

DOI | Publication

Download BibTex

Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware designers. Furthermore, the large memory footprint of DNNs, coupled with the FPGAs’ limited on-chip storage makes DNN acceleration using FPGAs more challenging. This work tackles these challenges by devising D nn W eaver , a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe [1]. To achieve large benefits while preserving automation, D nn W eaver generates accelerators using hand-optimized design templates. First, D nn W eaver translates a given high-level DNN specification to its novel ISA that represents a macro dataflow graph of the DNN. The D nn W eaver compiler is equipped with our optimization algorithm that tiles, schedules, and batches DNN operations to maximize data reuse and best utilize target FPGA’s memory and other resources. The final result is a custom synthesizable accelerator that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGA. We use D nn W eaver to generate accelerators for a set of eight different DNN models and three different FPGAs, Xilinx Zynq, Altera Stratix V, and Altera Arria 10. We use hardware measurements to compare the generated accelerators to both multicore CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650Ti, and Tesla K40). In comparison, the generated accelerators deliver superior performance and efficiency without requiring the programmers to participate in the arduous task of hardware design.