For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. This leads us to a hypothesis: since both approaches find flat solutions in orthogonal ways, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we provide potential explanations. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.

通过比较基于平坦极小点优化器的损失曲面和在计算机视觉、自然语言处理和图表示学习任务的广泛基准测试中，我们发现了一些令人惊讶的发现，希望这能帮助研究人员进一步改进深度学习优化器，并帮助实践者为其问题选择正确的优化器。

平坦极小值优化器何时有效？