We study private empirical risk minimization (ERM) problem for losses
satisfying the $(\gamma,\kappa)$-Kurdyka-{\L}ojasiewicz (KL) condition. The
Polyak-{\L}ojasiewicz (PL) condition is a special case of this condition when
$\kappa=2$. Specifically, we study this problem under the constraint of $\rho$
zero-concentrated differential privacy (zCDP). When $\kappa\in[1,2]$ and the
loss function is Lipschitz and smooth over a sufficiently large region, we
provide a new algorithm based on variance reduced gradient descent that
achieves the rate
$\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$ on the
excess empirical risk, where $n$ is the dataset size and $d$ is the dimension.
We further show that this rate is nearly optimal. When $\kappa \geq 2$ and the
loss is instead Lipschitz and weakly convex, we show it is possible to achieve
the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$
with a private implementation of the proximal point method. When the KL
parameters are unknown, we provide a novel modification and analysis of the
noisy gradient descent algorithm and show that this algorithm achieves a rate
of
$\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{\frac{2\kappa}{4-\kappa}}\big)$
adaptively, which is nearly optimal when $\kappa = 2$. We further show that,
without assuming the KL condition, the same gradient descent algorithm can
achieve fast convergence to a stationary point when the gradient stays
sufficiently large during the run of the algorithm. Specifically, we show that
this algorithm can approximate stationary points of Lipschitz, smooth (and
possibly nonconvex) objectives with rate as fast as
$\tilde{O}\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)$ and never worse than
$\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{1/2}\big)$. The latter
rate matches the best known rate for methods that do not rely on variance
reduction.

我们研究基于差分隐私的私有经验风险最小化问题，其中损失函数满足（γ，κ）-Kurdyka-Lojasiewicz 条件。当损失函数是利普希茨且光滑的时候，我们提出了一种基于方差减少梯度下降的新算法，并在超过经验风险的速率达到了几乎最优。当 KL 参数未知时，我们对噪声梯度下降算法进行了修改和分析，并证明了该算法在适应性上的性能几乎最优。同时，我们还展示了在不假设 KL 条件的情况下，同样的梯度下降算法可以以快速的收敛速度逼近利普希茨、光滑（甚至非凸）目标的驻点。

满足 KL 条件的差分私有非凸优化及最优速率

Differentially Private Non-Convex Optimization under the KL Condition  with Optimal Rates

As models for nature language processing (NLP), computer vision (CV) and
recommendation systems (RS) require surging computation, a large number of
GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput.
However, training such LB tasks often meets large generalization gap and
downgrades final precision, which limits enlarging the batch size. In this
work, we develop the variance reduced gradient descent technique (VRGD) based
on the gradient signal to noise ratio (GSNR) and apply it onto popular
optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of
convergence rate to explain its fast training dynamics, and a generalization
analysis to demonstrate its smaller generalization gap on LB training.
Comprehensive experiments demonstrate that VRGD can accelerate training ($1\sim
2 \times$), narrow generalization gap and improve final accuracy. We push the
batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without
noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by $0.52pp$
than LARS. The generalization gap of BERT and ImageNet training is
significantly reduce by over $65\%$.

基于梯度信噪比的方差缩减梯度下降技术对大批量任务进行了快速训练动态的理论分析和泛化分析，证明了其加速训练、缩小泛化差距和提高最终精度的效果。