PZO: Pseudo-Zeroth-Order Algorithm for Training Deep Neural Networks

  • Pengyun Yue ,
  • Xuanlin Yang ,
  • Mingqing Xiao ,
  • Zhouchen Lin

NeurIPS 2025 |

Zeroth-order Optimization (ZO) has received wide attention in machine learning, especially when computing full gradient is expensive or even impossible. Recently, ZO has emerged as an important paradigm for memory-efficient fine-tuning of large language models (LLMs), circumventing the memory overhead of backpropagation. However, existing ZO gradient estimators exhibit dimension-dependent variance scaling as [equation], leading to dimension-dependent convergence rates which is prohibitive for large-scale LLM parameters. To address this problem, we present a Pseudo-Zeroth-Order (PZO) framework for optimizing composite objective functions, especially large-scale models: [equation], where h represents complex, high-dimensional representations and [equation] is a task-specific loss. While existing zeroth-order methods estimate gradients with final loss functions, our PZO algorithm estimate the Jacobian matrix of [equation] with the model output [equation], and the gradient of the loss function on model output [equation]. Moreover, we apply exponential moving average on Jacobian estimators to reduce the variance. Experimental results demonstrate that PZO outperforms MeZO and MeZO-SVRG in classification, multiple choice and generation tasks in both full-parameter and PEFT fine-tuning settings by boosting convergence in the early stages of training. With the sliding window technique, our PZO only introduced a small dimension-independent memory overhead, which enables efficient scaling of the model size.