This paper follows up on a recent work of Neu et al. (2021) and presents some
new information-theoretic upper bounds for the generalization error of machine
learning models, such as neural networks, trained with SGD. We apply these
bounds to analyzing the generalization behaviour of linear and two-layer ReLU
networks. Experimental study of these bounds provide some insights on the SGD
training of neural networks. They also point to a new and simple regularization
scheme which we show performs comparably to the current state of the art.

本文基于 Neu et al. (2021) 的最新研究，在信息论方面提出了用于衡量机器学习模型的泛化误差的新上界。 通过应用这些上界，分析了线性和 ReLU 网络的泛化行为，并得出了关于 SGD 训练的洞见以及一种新的简单的正则化方案。实验结果表明此正则方案的表现与当前最先进的方案相媲美。

关于使用随机梯度下降训练的模型的泛化：信息论界限和含义

On the Generalization of Models Trained with SGD: Information-Theoretic  Bounds and Implications

Using an extended and formalized version of the Q/C map analysis of Poole et
al. (2016), along with Neural Tangent Kernel theory, we identify the main
pathologies present in deep networks that prevent them from training fast and
generalizing to unseen data, and show how these can be avoided by carefully
controlling the "shape" of the network's initialization-time kernel function.
We then develop a method called Deep Kernel Shaping (DKS), which accomplishes
this using a combination of precise parameter initialization, activation
function transformations, and small architectural tweaks, all of which preserve
the model class. In our experiments we show that DKS enables SGD training of
residual networks without normalization layers on Imagenet and CIFAR-10
classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet
models, with only a small decrease in generalization performance. And when
using K-FAC as the optimizer, we achieve similar results for networks without
skip connections. Our results apply for a large variety of activation
functions, including those which traditionally perform very badly, such as the
logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip
connections, normalization layers, special activation functions like RELU and
SELU, and various initialization schemes, explaining their effectiveness as
alternative (and ultimately incomplete) ways of "shaping" the network's
initialization-time kernel.

通过神经切线核理论和 Deep Kernel Shaping 方法，我们成功控制了深度神经网络的初始化时间内核函数的 “形状”，实现了无归一化层的残差网络的快速 SGD 训练，并同时提高了一些传统上性能非常差的激活函数的结果。

使用深度内核整形快速训练深度神经网络，无需跳跃连接或标准化层

Rapid training of deep neural networks without skip connections or  normalization layers using Deep Kernel Shaping

We propose a novel technique for faster deep neural network training which
systematically applies sample-based approximation to the constituent tensor
operations, i.e., matrix multiplications and convolutions. We introduce new
sampling techniques, study their theoretical properties, and prove that they
provide the same convergence guarantees when applied to SGD training. We apply
approximate tensor operations to single and multi-node training of MLP and CNN
networks on MNIST, CIFAR-10 and ImageNet datasets. We demonstrate up to 66%
reduction in the amount of computations and communication, and up to 1.37x
faster training time while maintaining negligible or no impact on the final
test accuracy.

通过对张量运算（矩阵乘法和卷积）应用基于样本的近似，提出了一种用于深度神经网络加速训练的新技术。应用到 MLP 和 CNN 网络的 MNIST，CIFAR-10 和 ImageNet 数据集的训练实验结果表明，该方法可以大幅度减少计算量和通讯量，并以不会对最终测试准确率产生可感知影响的方式提升训练速度。