Over the last decades, stochastic gradient descent (SGD) has been intensively
studied by the Machine Learning community. Despite its versatility and
excellent performance, the optimization of large models via SGD still is a
time-consuming task. To reduce training time, it is common to