Welder: Scheduling Deep Learning Memory Access via Tile-graph

17th USENIX Symposium on Operating Systems Design and Implementation (OSDI'23) |

With the growing demand for processing higher fidelity data and the use of faster computing cores in newer hardware accelerators, modern deep neural networks (DNNs) are becoming increasingly memory intensive. A disparity between underutilized computing cores and saturated memory bandwidth has been observed in various popular DNN models. This inefficiency is caused by both the conventional treatment of DNNs as compute-intensive workloads and the lack of holistic memory access optimization in DNN models.

In this paper, we introduce Welder, a deep learning compiler that optimizes the execution efficiency from a holistic memory access perspective. The core of Welder is tile-graph, an abstraction that facilitates fine-grained data management at tile level. By leveraging the observation of optimization independence across memory layers, Welder is able to decompose the whole combinatorial DNN optimization space into several independent ones and effectively trade off between intra- and inter-operator data reuse using a tile traffic-based cost model. This allows Welder to unify previous ad-hoc memory optimizations into a single space, generate efficient execution plans with 89 more optimization patterns, and outperform state-of-the-art solutions significantly. Welder is also able to handle DNN models with arbitrarily large input by combining the existing accelerator memory and host memory as a whole system.