The choice of step-size used in stochastic gradient descent (SGD)
optimization is empirically selected in most training procedures. Moreover, the
use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning,
and Warmup to tune the step-size requires extensive practical