随机、稀疏、非光滑梯度的自适应学习率和并行化

Jan, 2013

随机、稀疏、非光滑梯度的自适应学习率和并行化

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Tom Schaul, Yann LeCun

TL;DR本文针对随机梯度下降法（SGD）调参的问题，提出了一个不需调参的自动降低学习速率的方法，并通过在迭代中解决并行化、更新方法、非光滑损失函数以及 Hessian 矩阵估计等问题，提高了算法性能。最终算法具有线性复杂度和无需超参数。

Abstract

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing