The expected loss is an upper bound to the model generalization error which admits robust PAC-Bayes bounds for learning. However, loss minimization is known to ignore misspecification, where models cannot exactly reproduce observations. This leads to significant underestimates of parameter uncertainties in the large data, or underparameterized, limit. We analyze the generalization error of near-deterministic, misspecified and underparametrized surrogate models, a regime of broad relevance in science and engineering. We show posterior distributions must cover every training point to avoid a divergent generalization error and derive an ensemble {ansatz} that respects this constraint, which for linear models incurs minimal overhead. The efficient approach is demonstrated on model problems before application to high dimensional datasets in atomistic machine learning. Parameter uncertainties from misspecification survive in the underparametrized limit, giving accurate prediction and bounding of test errors.

前向论合理化了模型的泛化错误上界，为学习提供了健壮的PAC-Bayes边界。然而，已知损失的最小化会忽略错误规范化，在此情况下模型无法完全复现观测结果。我们分析了近确定、错误规范化和欠参数化替代模型的泛化错误，这是科学和工程中广泛相关的一种情况。我们证明了后验分布必须覆盖每个训练点以避免泛化错误的发散，并导出了一种满足此约束条件的集合假设，对线性模型而言额外开销最小。这种高效方法在模型问题上得到了证明，并应用于原子尺度机器学习中的高维数据集，由错误规范化导致的参数不确定性在欠参数化极限中仍然存在，从而可以准确预测和限定测试误差的上限。

近确定性回归中的规范误差不确定性