Indispensable CPU-centric Checkpointing for GPUs
- Junzhe Li ,
- Ran Shu ,
- Ziyue Yang ,
- Shuotao Xu ,
- Chenxiong Qian ,
- Yongqiang Xiong
APSys |
Published by ACM
Checkpointing is a special task in the modern deep learning training process as it poses a hard tradeoff between training efficiency and reliability. Frequent checkpoints of model states can enhance resilience to random system failures, yet obtaining checkpoints must halt the training process to prevent it from updating the model, which, however, inevitably hurts the overall training efficiency. To mitigate this dilemma, the GPUDirect storage (GDS) technique is gaining attention as it enables direct PCIe peer-to-peer accesses between SSDs and GPUs. Conceptually, this technique has the potential of fully exploiting the high PCIe bandwidth to accelerate the checkpointing process. Nevertheless, despite its promising features, GDS has not seen wide deployment in production.
In this paper, we examine the feasibility of GDS for checkpointing in real-world environments. Through careful analysis and experiments, we identify the fundamental constraints that hinder GDS deployment: the bandwidth of in-production storage devices lags behind the bandwidth expected by GDS in checkpointing. We argue that CPU-centric checkpointing remains indispensable in the foreseeable future and propose a single-copy optimization to further reduce memory overhead during checkpointing. Our results show that optimizing the checkpointing performance necessitates advancements in both storage and software.