GRNN: Low-Latency and Scalable RNN Inference on GPUs

Connor Holmes; Daniel Mawhirter; Yuxiong He; Feng Yan; Bo Wu; Yuxiong He

GRNN: Low-Latency and Scalable RNN Inference on GPUs

Connor Holmes ,
Daniel Mawhirter ,
Yuxiong He ,
Feng Yan ,
Bo Wu ,
Yuxiong He

EuroSys | September 2019

Download BibTex

Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, due to the complex data dependencies and limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading to inefficient resource utilization. Consequently, companies like Microsoft and Facebook use CPUs to serve RNN models.

This work demonstrates the root causes of the unsatisfactory performance of existing implementations for RNN inference on GPUs from several aspects, including poor data reuse, low on-chip resource utilization, and high synchronization overhead. We systematically address these issues and develop a GPU-based RNN inference library, called GRNN, that provides low latency, high throughput, and efficient resource utilization. GRNN minimizes global memory accesses and synchronization overhead, as well as balancing on-chip resource usage through novel data reorganization, thread mapping, and performance modeling techniques. Evaluated on extensive benchmarking and real-world applications, we show that GRNN outperforms the state-of-the-art CPU inference library by up to 17.5X and state-of-the-art GPU inference libraries by up to 9X in terms of latency reduction.