Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learning on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center's (NERSC) Perlmutter supercomputer for various OGB datasets.

本研究解决了大规模连接图在图神经网络（GNN）训练中面临的计算成本和性能挑战。通过在顶尖的Amazon DistDGL分布式GNN框架上开发参数化的连续预取和驱逐机制，本论文提出了改进采样和通信开销的实用折衷方案，从而在国家能源研究科学计算中心的Perlmutter超级计算机上实现了15-40%的训练性能提升。

MassiveGNN：通过预取提升大规模连接分布式图的高效训练