For large scale non-convex stochastic optimization, parallel mini-batch sgd using multiple workers ideally can achieve a linear speed-up with respect to the number of workers compared with SGD over a single worker. However, such linear scalability in practice is significantly limited b