Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize, and (3) perhaps most surprisingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks.

尽管进行了广泛的研究，但超参数化神经网络为何能够泛化的根本原因仍然不清楚。本研究通过理论和实证研究指出，对于两层ReLU网络，（1）平坦确实意味着泛化；（2）存在不泛化的最平坦模型，锐度最小化算法无法泛化；（3）最令人惊讶的是，存在不泛化的最平坦模型，但锐度最小化算法仍然可以泛化。我们的结果表明，锐度与泛化之间的关系微妙地依赖于数据分布和模型架构，锐度最小化算法不仅通过最小化锐度来实现更好的泛化。这需要寻找超参数化神经网络泛化的其他解释。

尖锐度最小化算法不仅仅通过最小化尖锐度来达到更好泛化