Humans are able to accelerate their learning by selecting training materials
that are the most informative and at the appropriate level of difficulty. We
propose a framework for distributing deep learning in which one set of workers
search for the most informative examples in parallel while a single worker
updates the model on examples selected by importance sampling. This leads the
model to update using an unbiased estimate of the gradient which also has
minimum variance when the sampling proposal is proportional to the L2-norm of
the gradient. We show experimentally that this method reduces gradient variance
even in a context where the cost of synchronization across machines cannot be
ignored, and where the factors for importance sampling are not updated
instantly across the training set.

该研究提出一种分布式深度学习框架，其中一组工作者并行搜索最具信息性的示例，而单个工作者则使用重要性抽样方法更新模型。实验证明，当采样提议与梯度的 L2 范数成正比时，该方法可以减少梯度方差，即使在跨机器同步成本不可忽略且重要性抽样因子不会立即更新的情况下也是如此。