We consider the development of practical stochastic quasi-Newton, and in
particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for
training deep neural networks (DNNs). In DNN training, the number of variables
and components of the gradient $n$ is often of the order of tens of millions
and the Hessian has $n^2$ elements. Consequently, computing and storing a full
$n \times n$ BFGS approximation or storing a modest number of (step, change in
gradient) vector pairs for use in an L-BFGS implementation is out of the
question. In our proposed methods, we approximate the Hessian by a
block-diagonal matrix and use the structure of the gradient and Hessian to
further approximate these blocks, each of which corresponds to a layer, as the
Kronecker product of two much smaller matrices. This is analogous to the
approach in KFAC, which computes a Kronecker-factored block-diagonal
approximation to the Fisher matrix in a stochastic natural gradient method.
Because the indefinite and highly variable nature of the Hessian in a DNN, we
also propose a new damping approach to keep the upper as well as the lower
bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder
feed-forward neural network models with either nine or thirteen layers applied
to three datasets, our methods outperformed or performed comparably to KFAC and
state-of-the-art first-order stochastic methods.

本文提出了一种使用 Kronecker 乘积近似 Hessian 矩阵和结构化梯度的 Kronecker 分块对角线 BFGS 和 L-BFGS 方法用于深度神经网络训练，通过测试验证其性能优于或与 KFAC 和一阶随机方法相当。

训练深度神经网络的实用拟牛顿方法

Practical Quasi-Newton Methods for Training Deep Neural Networks

Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional
performance on several pattern recognition problems. However, the training of
RNNs is a computationally difficult task owing to the well-known
"vanishing/exploding" gradient problem. Algorithms proposed for training RNNs
either exploit no (or limited) curvature information and have cheap
per-iteration complexity, or attempt to gain significant curvature information
at the cost of increased per-iteration cost. The former set includes
diagonally-scaled first-order methods such as ADAGRAD and ADAM, while the
latter consists of second-order algorithms like Hessian-Free Newton and K-FAC.
In this paper, we present adaQN, a stochastic quasi-Newton algorithm for
training RNNs. Our approach retains a low per-iteration cost while allowing for
non-diagonal scaling through a stochastic L-BFGS updating scheme. The method
uses a novel L-BFGS scaling initialization scheme and is judicious in storing
and retaining L-BFGS curvature pairs. We present numerical experiments on two
language modeling tasks and show that adaQN is competitive with popular RNN
training algorithms.

本篇论文提出了一种名为 adaQN 的随机拟牛顿算法，用于解决循环神经网络 (RNN) 训练中的梯度消失 / 爆炸问题，该方法使用了一种新的 L-BFGS 缩放初始化方案，并且在存储和保留 L-BFGS 曲率对方面非常明智，实践表明 adaQN 与流行的 RNN 训练算法有相当的竞争力。

adaQN: 一种适应性拟牛顿算法用于训练循环神经网络

adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

Quasi-Newton methods are widely used in practise for convex loss minimization
problems. These methods exhibit good empirical performance on a wide variety of
tasks and enjoy super-linear convergence to the optimal solution. For
large-scale learning problems, stochastic Quasi-Newton methods have been
recently proposed. However, these typically only achieve sub-linear convergence
rates and have not been shown to consistently perform well in practice since
noisy Hessian approximations can exacerbate the effect of high-variance
stochastic gradient estimates. In this work we propose Vite, a novel stochastic
Quasi-Newton algorithm that uses an existing first-order technique to reduce
this variance. Without exploiting the specific form of the approximate Hessian,
we show that Vite reaches the optimum at a geometric rate with a constant
step-size when dealing with smooth strongly convex functions. Empirically, we
demonstrate improvements over existing stochastic Quasi-Newton and variance
reduced stochastic gradient methods.

该研究提出了一种名为 “Vite” 的基于 Stochastic Quasi-Newton 算法的优化方法，它利用一种现有的一阶技术来减少噪声和方差，并在大规模学习问题上取得了不错的结果。