With multiple iterations of updates, local statistical gradient descent
(L-SGD) has been proven to be very effective in distributed machine learning
schemes such as federated learning. In fact, many innovative works have shown
that L-SGD with independent and identically distributed (IID) data can even
outperform SGD. As a result, extensive efforts have been made to unveil the
power of L-SGD. However, existing analysis failed to explain why the multiple
local updates with small mini-batches of data (L-SGD) can not be replaced by
the update with one big batch of data and a larger learning rate (SGD). In this
paper, we offer a new perspective to understand the strength of L-SGD. We
theoretically prove that, with IID data, L-SGD can effectively explore the
second order information of the loss function. In particular, compared with
SGD, the updates of L-SGD have much larger projection on the eigenvectors of
the Hessian matrix with small eigenvalues, which leads to faster convergence.
Under certain conditions, L-SGD can even approach the Newton method. Experiment
results over two popular datasets validate the theoretical results.

该论文通过理论分析和实验证明，本地统计梯度下降（L-SGD）可以更有效地探索损失函数的二阶信息，从而比随机梯度下降（SGD）更快地收敛。