Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

  • Bin Lin ,
  • Ningxin Zheng ,
  • Lei Wang ,
  • Shijie Cao ,
  • Lingxiao Ma ,
  • Quanlu Zhang ,
  • Yi Zhu ,
  • Ting Cao ,
  • Jilong Xue ,
  • ,

Sixth Conference on Machine Learning and Systems (MLSys'23) |

N:M sparsity is becoming increasingly popular for its potential to deliver high model accuracy and computational efficiency for deep learning. However, the real-world benefit of N:M sparsity is limited as there is a lack of dedicated GPU kernel implementations for general N:M sparsity with various sparsity ratios. In this work, we introduce nmSPARSE, a library of efficient GPU kernels for two fundamental operations in neural networks with N:M sparse weights: sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). By exploiting the intrinsic balance characteristic of N:M sparsity, nmSPARSE kernels rearrange irregular computation and scattered memory accesses in sparse matrix multiplication into hardware-aligned regular computation and conflict-free memory accesses at runtime. When evaluated on NVIDIA A100 GPU, nmSPARSE kernels achieve up to 5.2× speedup on SpMV and 6.0× speedup on SpMM over the fastest baseline. End-to-end studies on transformer models demonstrate that using nmSPARSE outperforms other baselines.