With the rapid growth of data, distributed stochastic gradient descent~(DSGD)
has been widely used for solving large-scale machine learning problems. Due to
the latency and limited bandwidth of network, communication has become the
bottleneck of DSGD when we need to train large scale m