Training recurrent neural networks (RNNs) remains a challenge due to the
instability of gradients across long time horizons, which can lead to exploding
and vanishing gradients. Recent research has linked these problems to the
values of Lyapunov exponents for the forward-dynamics, which describe the
growth or shrinkage of infinitesimal perturbations. Here, we propose gradient
flossing, a novel approach to tackling gradient instability by pushing Lyapunov
exponents of the forward dynamics toward zero during learning. We achieve this
by regularizing Lyapunov exponents through backpropagation using differentiable
linear algebra. This enables us to "floss" the gradients, stabilizing them and
thus improving network training. We demonstrate that gradient flossing controls
not only the gradient norm but also the condition number of the long-term
Jacobian, facilitating multidimensional error feedback propagation. We find
that applying gradient flossing prior to training enhances both the success
rate and convergence speed for tasks involving long time horizons. For
challenging tasks, we show that gradient flossing during training can further
increase the time horizon that can be bridged by backpropagation through time.
Moreover, we demonstrate the effectiveness of our approach on various RNN
architectures and tasks of variable temporal complexity. Additionally, we
provide a simple implementation of our gradient flossing algorithm that can be
used in practice. Our results indicate that gradient flossing via regularizing
Lyapunov exponents can significantly enhance the effectiveness of RNN training
and mitigate the exploding and vanishing gradient problem.

通过调节 Lyapunov 指数来稳定梯度并提高循环神经网络 (RNN) 训练的有效性，减缓梯度爆炸和梯度消失问题。

梯度抛光：通过动态控制雅可比矩阵来改进梯度下降

Gradient Flossing: Improving Gradient Descent through Dynamic Control of  Jacobians

Deep neural network training spends most of the computation on examples that
are properly handled, and could be ignored. We propose to mitigate this
phenomenon with a principled importance sampling scheme that focuses
computation on "informative" examples, and reduces the variance of the
stochastic gradients during training. Our contribution is twofold: first, we
derive a tractable upper bound to the per-sample gradient norm, and second we
derive an estimator of the variance reduction achieved with importance
sampling, which enables us to switch it on when it will result in an actual
speedup. The resulting scheme can be used by changing a few lines of code in a
standard SGD procedure, and we demonstrate experimentally, on image
classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock
time budget, it provides a reduction of the train losses of up to an order of
magnitude and a relative improvement of test errors between 5% and 17%.

本研究提出了一种基于重要性采样的计算优化方案，该方案能够减少深度神经网络训练过程中冗余计算，提升模型的训练效果并有效降低损失。实验结果显示，该方案能够在相同的时间预算下，将训练损失降低一个数量级，并提高测试误差 5％至 17％。