How much information do NLP tasks really need from a transformer's attention
mechanism at application-time (inference)? From recent work, we know that there
is sparsity in transformers and that the floating-points within its computation
can be discretized to fewer values with minimal loss to task accuracies.
However, this requires retraining or even creating entirely new models, both of
which can be expensive and carbon-emitting. Focused on optimizations that do
not require training, we systematically study the full range of typical
attention values necessary. This informs the design of an inference-time
quantization technique using both pruning and log-scaled mapping which produces
only a few (e.g. $2^3$) unique values. Over the tasks of question answering and
sentiment analysis, we find nearly 80% of attention values can be pruned to
zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning
technique in conjunction with quantizing the attention values to only a 3-bit
format, without retraining, resulting in only a 0.8% accuracy reduction on
question answering with fine-tuned RoBERTa.

研究了 transformer 的注意机制需要多少信息在应用（推理）时真正需要，并且针对不需要训练的优化进行了系统研究，提出了基于剪枝和对数尺度映射的推理时间量化技术，发现 80％的注意值可以剪枝为零，而精度只有不到 1.0％的相对损失，使用这种剪枝技术，结合对注意值进行量化到仅 3 位格式，不需要重新训练，在细调过的 RoBERTa 下只会导致 0.8％的精度损失。

Transformer 中 Attention 值的分布、稀疏性和推断时量化

On the Distribution, Sparsity, and Inference-time Quantization of  Attention Values in Transformers

Federated learning enables resource-constrained edge compute devices, such as
mobile phones and IoT devices, to learn a shared model for prediction, while
keeping the training data local. This decentralized approach to train models
provides privacy, security, regulatory and economic benefits. In this work, we
focus on the statistical challenge of federated learning when local data is
non-IID. We first show that the accuracy of federated learning reduces
significantly, by up to 55% for neural networks trained for highly skewed
non-IID data, where each client device trains only on a single class of data.
We further show that this accuracy reduction can be explained by the weight
divergence, which can be quantified by the earth mover's distance (EMD) between
the distribution over classes on each device and the population distribution.
As a solution, we propose a strategy to improve training on non-IID data by
creating a small subset of data which is globally shared between all the edge
devices. Experiments show that accuracy can be increased by 30% for the
CIFAR-10 dataset with only 5% globally shared data.

本文着重研究了在本地数据不 IID 的情况下联邦学习面临的统计挑战，提出了一个使用全局数据子集来提高非 IID 数据训练准确性的解决方案，并通过实验表明，使用仅占 5％的全局数据子集就可以将 CIFAR-10 数据集的准确性提高 30％。