Numerous theories of learning suggest to prevent the gradient variance from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory that we refer to as the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks to fulfill the LSC often results in improved final performance across models. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.

通过本研究，我们证明预训练网络以实现本地稳定性在复杂结构的网络中是有效的，并提出了一种称为本地稳定条件（LSC）的理论，它能最小化对数据和参数分布的假设。我们的实验结果表明，通过预训练满足LSC的前馈和递归网络通常能够提高最终性能。这项研究为实现任意复杂度的网络的稳定性提供了一种方法，该方法可以在大型增强数据集的预训练之前作为附加步骤，也可以作为在分析上找到稳定的初始状态的替代方法。

通过预训练稳定RNN梯度