We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on the diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes

本文提出了一种自适应随机梯度下降算法NovoGrad，具有分层梯度归一化和分离的权重衰减，模型在图像分类、语音识别、机器翻译和语言模型等领域中的表现优于标准的随机梯度下降算法SGD与Adam（优化算法），同时具有较好的鲁棒性、适用于大批量的训练，并且更省内存。

使用逐层自适应动量的随机梯度方法用于深度神经网络的训练