With multiple iterations of updates, local statistical gradient descent
(L-SGD) has been proven to be very effective in distributed machine learning
schemes such as federated learning. In fact, many innovative works have shown
that L-SGD with independent and identically distributed (IID) data can even
outperform SGD. As a result, extensive efforts have been made to unveil the
power of L-SGD. However, existing analysis failed to explain why the multiple
local updates with small mini-batches of data (L-SGD) can not be replaced by
the update with one big batch of data and a larger learning rate (SGD). In this
paper, we offer a new perspective to understand the strength of L-SGD. We
theoretically prove that, with IID data, L-SGD can effectively explore the
second order information of the loss function. In particular, compared with
SGD, the updates of L-SGD have much larger projection on the eigenvectors of
the Hessian matrix with small eigenvalues, which leads to faster convergence.
Under certain conditions, L-SGD can even approach the Newton method. Experiment
results over two popular datasets validate the theoretical results.

该论文通过理论分析和实验证明，本地统计梯度下降（L-SGD）可以更有效地探索损失函数的二阶信息，从而比随机梯度下降（SGD）更快地收敛。

利用损失函数的二阶信息加速收敛的本地随机梯度下降

Local SGD Accelerates Convergence by Exploiting Second Order Information  of the Loss Function

This paper presents a state-of-the-art model for visual question answering
(VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of
significant importance for research in artificial intelligence, given its
multimodal nature, clear evaluation protocol, and potential real-world
applications. The performance of deep neural networks for VQA is very dependent
on choices of architectures and hyperparameters. To help further research in
the area, we describe in detail our high-performing, though relatively simple
model. Through a massive exploration of architectures and hyperparameters
representing more than 3,000 GPU-hours, we identified tips and tricks that lead
to its success, namely: sigmoid outputs, soft training targets, image features
from bottom-up attention, gated tanh activations, output embeddings initialized
using GloVe and Google Images, large mini-batches, and smart shuffling of
training data. We provide a detailed analysis of their impact on performance to
assist others in making an appropriate selection.

本文介绍了一个用于视觉问答（VQA）的最先进模型，这个模型在 2017 年的 VQA 挑战中获得了第一名。通过对超过 3,000 个 GPU 小时的架构和超参数的深入探索，我们发现了许多用于提高性能的 Tips and Tricks。我们详细地分析了它们的影响以协助其他人进行适当的选择。

视觉问答技巧：2017 年挑战赛收获

Tips and Tricks for Visual Question Answering: Learnings from the 2017  Challenge

We analyze the learning properties of the stochastic gradient method when
multiple passes over the data and mini-batches are allowed. We study how
regularization properties are controlled by the step-size, the number of passes
and the mini-batch size. In particular, we consider the square loss and show
that for a universal step-size choice, the number of passes acts as a
regularization parameter, and optimal finite sample bounds can be achieved by
early-stopping. Moreover, we show that larger step-sizes are allowed when
considering mini-batches. Our analysis is based on a unifying approach,
encompassing both batch and stochastic gradient methods as special cases. As a
byproduct, we derive optimal convergence results for batch gradient methods
(even in the non-attainable cases).

本文研究了随机梯度方法在多次迭代和小批量训练时的学习特性，并且调节了正则化特性的参数，确认了通过控制迭代次数可以达到最优的有限样本界，同时，合适的步长可以让较大的批量予以考虑，我们使用统一方法，将批量和随机梯度方法作为特例，得到了批量梯度方法的最优收敛结果 (即使在不可达的情况下)。

多遍随机梯度方法的最优收敛速率

Optimal Rates for Multi-pass Stochastic Gradient Methods

We address the issue of using mini-batches in stochastic optimization of
SVMs. We show that the same quantity, the spectral norm of the data, controls
the parallelization speedup obtained for both primal stochastic subgradient
descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it
to derive novel variants of mini-batched SDCA. Our guarantees for both methods
are expressed in terms of the original nonsmooth primal problem based on the
hinge-loss.

本文探讨了在 SVM 的随机优化中使用小批量的问题，并提出了新的 mini-batched SDCA 变体。在原始基于 hinge-loss 的非光滑 primal 问题方面，我们对这两种方法都给出了保证。