Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by stochastic gradient descent (SGD). The emergent consensus explanation has two parts: the first is that there are "no bad local minima", while the second is that SGD perform