stateful optimizers maintain gradient statistics over time, e.g., the
exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past
gradient values. This state can be used to accelerate optimization compared to
plain stochastic gradient descent but uses memory that might