Graph classification benchmarks, vital for assessing and developing graph neural networks (GNNs), have recently been scrutinized, as simple methods like MLPs have demonstrated comparable performance. This leads to an important question: Do these benchmarks effectively distinguish the advancements of GNNs over other methodologies? If so, how do we quantitatively measure this effectiveness? In response, we first propose an empirical protocol based on a fair benchmarking framework to investigate the performance discrepancy between simple methods and GNNs. We further propose a novel metric to quantify the dataset effectiveness by considering both dataset complexity and model performance. To the best of our knowledge, our work is the first to thoroughly study and provide an explicit definition for dataset effectiveness in the graph learning area. Through testing across 16 real-world datasets, we found our metric to align with existing studies and intuitive assumptions. Finally, we explore the causes behind the low effectiveness of certain datasets by investigating the correlation between intrinsic graph properties and class labels, and we developed a novel technique supporting the correlation-controllable synthetic dataset generation. Our findings shed light on the current understanding of benchmark datasets, and our new platform could fuel the future evolution of graph classification benchmarks.

在图学习领域中，我们首次全面研究并明确定义了数据集有效性，通过基于公平基准测试框架的经验性协议和考虑数据集复杂性和模型性能的新指标，我们发现我们的指标与现有研究和直觉假设一致，同时通过研究内在图属性和类标签之间的相关性以及开发支持相关性可控合成数据集生成的新技术，我们揭示了某些数据集低有效性背后的原因，我们的发现为当前对基准数据集的理解提供了启示，并可推动未来图分类基准的进化。

重新考虑用于评估GNNs的基准图分类数据集的有效性