This study examines the generalization ability of algorithm performance
prediction models across various benchmark suites. Comparing the statistical
similarity between the problem collections with the accuracy of performance
prediction models that are based on exploratory landscape analysis features, we
observe that there is a positive correlation between these two measures.
Specifically, when the high-dimensional feature value distributions between
training and testing suites lack statistical significance, the model tends to
generalize well, in the sense that the testing errors are in the same range as
the training errors. Two experiments validate these findings: one involving the
standard benchmark suites, the BBOB and CEC collections, and another using five
collections of affine combinations of BBOB problem instances.

该研究通过对各种基准套件的算法性能预测模型的泛化能力进行考察，比较问题集合的统计相似性和基于探索性景观分析特征的性能预测模型的准确性，我们发现这两个指标之间存在着正相关关系。具体来说，当训练和测试套件之间的高维特征值分布缺乏统计显著性时，模型往往能够很好地进行泛化，即测试误差与训练误差处于同一范围内。两个实验证实了这些发现：一个涉及标准基准套件 BBOB 和 CEC 集合，另一个使用了五个由 BBOB 问题实例的仿射组合构成的集合。

基于特征的性能预测模型的泛化能力研究：基准测试的统计分析

Generalization Ability of Feature-based Performance Prediction Models: A  Statistical Analysis across Benchmarks

A key component of automated algorithm selection and configuration, which in
most cases are performed using supervised machine learning (ML) methods is a
good-performing predictive model. The predictive model uses the feature
representation of a set of problem instances as input data and predicts the
algorithm performance achieved on them. Common machine learning models struggle
to make predictions for instances with feature representations not covered by
the training data, resulting in poor generalization to unseen problems. In this
study, we propose a workflow to estimate the generalizability of a predictive
model for algorithm performance, trained on one benchmark suite to another. The
workflow has been tested by training predictive models across benchmark suites
and the results show that generalizability patterns in the landscape feature
space are reflected in the performance space.

本研究提出了一种可以估算算法性能预测模型泛化能力的方法，并通过在基准测试套件之间训练预测模型来测试该方法的可行性，结果表明，特征空间中的泛化模式确实反映在性能空间中。

评估性能预测模型的泛化能力

Assessing the Generalizability of a Performance Predictive Model

Progress in machine learning is measured by careful evaluation on problems of
outstanding common interest. However, the proliferation of benchmark suites and
environments, adversarial attacks, and other complications has diluted the
basic evaluation model by overwhelming researchers with choices. Deliberate or
accidental cherry picking is increasingly likely, and designing well-balanced
evaluation suites requires increasing effort. In this paper we take a step back
and propose Nash averaging. The approach builds on a detailed analysis of the
algebraic structure of evaluation in two basic scenarios: agent-vs-agent and
agent-vs-task. The key strength of Nash averaging is that it automatically
adapts to redundancies in evaluation data, so that results are not biased by
the incorporation of easy tasks or weak agents. Nash averaging thus encourages
maximally inclusive evaluation -- since there is no harm (computational cost
aside) from including all available tasks and agents.

本文介绍了一种叫作 Nash 平均的评估方法，能够自动适应评估数据中的冗余信息，从而避免了采用简单任务或弱智能体造成的结果偏差，实现了最大程度的评估包容性。