As the use of large embedding models in recommendation systems and language
applications increases, concerns over user data privacy have also risen.
DP-SGD, a training algorithm that combines differential privacy with stochastic
gradient descent, has been the workhorse in protecting user privacy without
compromising model accuracy by much. However, applying DP-SGD naively to
embedding models can destroy gradient sparsity, leading to reduced training
efficiency. To address this issue, we present two new algorithms, DP-FEST and
DP-AdaFEST, that preserve gradient sparsity during private training of large
embedding models. Our algorithms achieve substantial reductions ($10^6 \times$)
in gradient size, while maintaining comparable levels of accuracy, on benchmark
real-world datasets.

使用 DP-SGD 算法对大型嵌入模型进行隐私训练时，为了维持梯度稀疏性，我们提出了两个新算法 DP-FEST 和 DP-AdaFEST，能够在保持相当准确性的同时，实现梯度大小的大幅度降低（$10^6 	imes$）。

大规模嵌入模型的稀疏保持差分私有训练

Sparsity-Preserving Differentially Private Training of Large Embedding  Models

A recent empirical observation of activation sparsity in MLP layers offers an
opportunity to drastically reduce computation costs for free. Despite several
works attributing it to training dynamics, the theoretical explanation of
activation sparsity's emergence is restricted to shallow networks, small
training steps well as modified training, even though the sparsity has been
found in deep models trained by vanilla protocols for large steps. To fill the
three gaps, we propose the notion of gradient sparsity as the source of
activation sparsity and a theoretical explanation based on it that explains
gradient sparsity and then activation sparsity as necessary steps to
adversarial robustness w.r.t. hidden features and parameters, which is
approximately the flatness of minima for well-learned models. The theory
applies to standardly trained LayerNorm-ed pure MLPs, and further to
Transformers or other architectures if noises are added to weights during
training. To eliminate other sources of flatness when arguing sparsities'
necessity, we discover the phenomenon of spectral concentration, i.e., the
ratio between the largest and the smallest non-zero singular values of weight
matrices is small. We utilize random matrix theory (RMT) as a powerful
theoretical tool to analyze stochastic gradient noises and discuss the
emergence of spectral concentration. With these insights, we propose two
plug-and-play modules for both training from scratch and sparsity finetuning,
as well as one radical modification that only applies to from-scratch training.
Another under-testing module for both sparsity and flatness is also immediate
from our theories. Validational experiments are conducted to verify our
explanation. Experiments for productivity demonstrate modifications'
improvement in sparsity, indicating further theoretical cost reduction in both
training and inference.

基于梯度稀疏性和随机矩阵理论的激活稀疏性，该研究解释了深度模型中激活稀疏性的理论机制以及其在对抗鲁棒性和性能方面的重要性，并提出了几种用于训练和稀疏调整的模块和修改的方法。

激活稀疏性的理论解释：通过平坦极小值和对抗性鲁棒性

Theoretical Explanation of Activation Sparsity through Flat Minima and  Adversarial Robustness

Variance reduction methods such as SVRG and SpiderBoost use a mixture of
large and small batch gradients to reduce the variance of stochastic gradients.
Compared to SGD, these methods require at least double the number of operations
per update to model parameters. To reduce the computational cost of these
methods, we introduce a new sparsity operator: The random-top-k operator. Our
operator reduces computational complexity by estimating gradient sparsity
exhibited in a variety of applications by combining the top-k operator and the
randomized coordinate descent operator. With this operator, large batch
gradients offer an extra benefit beyond variance reduction: A reliable estimate
of gradient sparsity. Theoretically, our algorithm is at least as good as the
best algorithm (SpiderBoost), and further excels in performance whenever the
random-top-k operator captures gradient sparsity. Empirically, our algorithm
consistently outperforms SpiderBoost using various models on various tasks
including image classification, natural language processing, and sparse matrix
factorization. We also provide empirical evidence to support the intuition
behind our algorithm via a simple gradient entropy computation, which serves to
quantify gradient sparsity at every iteration.

本文提出了一种新的稀疏操作符：随机 Top-k 操作符，用于估计梯度稀疏性，将其与随机化坐标下降操作符结合，可降低 SVRG 和 SpiderBoost 方法的计算复杂度。实验证明该方法在各种模型和任务中的表现优于 SpiderBoost。