We present a novel, fast (exponential rate adaption), ab initio
(hyper-parameter-free) gradient based optimizer algorithm. The main idea of the
method is to adapt the learning rate $\alpha$ by situational awareness, mainly
striving for orthogonal neighboring gradients. The method has a high success
and fast convergence rate and does not rely on hand-tuned parameters giving it
greater universality. It can be applied to problems of any dimensions n and
scales only linearly (of order O(n)) with the dimension of the problem. It
optimizes convex and non-convex continuous landscapes providing some kind of
gradient. In contrast to the Ada-family (AdaGrad, AdaMax, AdaDelta, Adam, etc.)
the method is rotation invariant: optimization path and performance are
independent of coordinate choices. The impressive performance is demonstrated
by extensive experiments on the MNIST benchmark data-set against
state-of-the-art optimizers. We name this new class of optimizers after its
core idea Exponential Learning Rate Adaption - ELRA. We present it in two
variants c2min and p2min with slightly different control. The authors strongly
believe that ELRA will open a completely new research direction for gradient
descent optimize.

我们提出了一种新型的、快速的、基于梯度的优化算法，通过情景感知来自适应学习率，以正交邻近梯度为主要思路。该方法具有快速收敛速度，不依赖手动调参参数，具有更大的普适性，在维度 n 为任意大小的问题上可线性扩展。通过在 MNIST 基准数据集上与最先进的优化器进行广泛实验，展示了其出色的性能。我们将这种新型优化器命名为指数学习率自适应（ELRA），它将为梯度下降优化开辟全新的研究方向。

ELRA: 指数学习率自适应梯度下降优化方法

ELRA: Exponential learning rate adaption gradient descent optimization  method

We develop an approach to efficiently grow neural networks, within which
parameterization and optimization strategies are designed by considering their
effects on the training dynamics. Unlike existing growing methods, which follow
simple replication heuristics or utilize auxiliary gradient-based local
optimization, we craft a parameterization scheme which dynamically stabilizes
weight, activation, and gradient scaling as the architecture evolves, and
maintains the inference functionality of the network. To address the
optimization difficulty resulting from imbalanced training effort distributed
to subnetworks fading in at different growth phases, we propose a learning rate
adaption mechanism that rebalances the gradient contribution of these separate
subcomponents. Experimental results show that our method achieves comparable or
better accuracy than training large fixed-size models, while saving a
substantial portion of the original computation budget for training. We
demonstrate that these gains translate into real wall-clock training speedups.

通过考虑参数化和优化策略对训练动态的影响，我们开发了一种高效增长神经网络的方法，该方法动态稳定权重、激活和梯度缩放，提出一种学习率适应机制来解决不平衡训练问题，并取得了与训练大型固定模型相当或更好的准确性和训练速度加快。