Finetuning on task-specific datasets is a widely-embraced paradigm of
harnessing the powerful capability of pretrained LLMs for various downstream
tasks. Due to the popularity of LLMs finetuning and its accompanying privacy
concerns, differentially private (DP) finetuning of pretrained LLMs has
garnered increasing attention to safeguarding the privacy of task-specific
datasets. Lying at the design core of DP LLM finetuning methods is the
satisfactory tradeoff between privacy, utility, and scalability. Most existing
methods build upon the seminal work of DP-SGD. Despite pushing the scalability
of DP-SGD to its limit, DP-SGD-based finetuning methods are unfortunately
limited by the inherent inefficiency of SGD. In this paper, we investigate the
potential of DP zeroth-order methods for LLM pretraining, which avoids the
scalability bottleneck of SGD by approximating the gradient with the more
efficient zeroth-order gradient. Rather than treating the zeroth-order method
as a drop-in replacement for SGD, this paper presents a comprehensive study
both theoretically and empirically. First, we propose the stagewise DP
zeroth-order method that dynamically schedules key hyperparameters. This design
is grounded on the synergy between DP random perturbation and the gradient
approximation error of the zeroth-order method, and its effect on finetuning
trajectory. Second, we further enhance the scalability by reducing the
trainable parameters that are identified by repurposing a data-free pruning
technique requiring no additional data or extra privacy budget. We provide
theoretical analysis for both proposed methods. We conduct extensive empirical
analysis on both encoder-only masked language model and decoder-only
autoregressive language model, achieving impressive results in terms of
scalability and utility.

本文研究了差分隐私零阶方法在预训练语言模型中的潜力，通过近似梯度避免了 SGD 的可扩展性瓶颈，并提出了动态调度超参数的阶段性差分隐私零阶方法和减少可训练参数的数据无关剪枝技术，从理论和实证分析了这两种方法的效果。

规模化大型语言模型微调的差分隐私零阶方法

Differentially Private Zeroth-Order Methods for Scalable Large Language  Model Finetuning

Proximal gradient method has been playing an important role to solve many
machine learning tasks, especially for the nonsmooth problems. However, in some
machine learning problems such as the bandit model and the black-box learning
problem, proximal gradient method could fail because the explicit gradients of
these problems are difficult or infeasible to obtain. The gradient-free
(zeroth-order) method can address these problems because only the objective
function values are required in the optimization. Recently, the first
zeroth-order proximal stochastic algorithm was proposed to solve the nonconvex
nonsmooth problems. However, its convergence rate is $O(\frac{1}{\sqrt{T}})$
for the nonconvex problems, which is significantly slower than the best
convergence rate $O(\frac{1}{T})$ of the zeroth-order stochastic algorithm,
where $T$ is the iteration number. To fill this gap, in the paper, we propose a
class of faster zeroth-order proximal stochastic methods with the variance
reduction techniques of SVRG and SAGA, which are denoted as ZO-ProxSVRG and
ZO-ProxSAGA, respectively. In theoretical analysis, we address the main
challenge that an unbiased estimate of the true gradient does not hold in the
zeroth-order case, which was required in previous theoretical analysis of both
SVRG and SAGA. Moreover, we prove that both ZO-ProxSVRG and ZO-ProxSAGA
algorithms have $O(\frac{1}{T})$ convergence rates. Finally, the experimental
results verify that our algorithms have a faster convergence rate than the
existing zeroth-order proximal stochastic algorithm.

本文提出了两种新的零阶近端随机优化算法 ZO-ProxSVRG 和 ZO-ProxSAGA，它们利用了 SVRG 和 SAGA 的方差缩减技术，并证明了它们具有线性 $O (rac {1}{T})$ 的收敛速度，实验结果表明相比于现有的零阶近端随机算法，新算法有更快的收敛速度。