Many real world graphs contain time domain information. Temporal Graph Neural
Networks capture temporal information as well as structural and contextual
information in the generated dynamic node embeddings. Researchers have shown
that these embeddings achieve state-of-the-art performance in many different
tasks. In this work, we propose TGL, a unified framework for large-scale
offline Temporal Graph Neural Network training where users can compose various
Temporal Graph Neural Networks with simple configuration files. TGL comprises
five main components, a temporal sampler, a mailbox, a node memory module, a
memory updater, and a message passing engine. We design a Temporal-CSR data
structure and a parallel sampler to efficiently sample temporal neighbors to
formtraining mini-batches. We propose a novel random chunk scheduling technique
that mitigates the problem of obsolete node memory when training with a large
batch size. To address the limitations of current TGNNs only being evaluated on
small-scale datasets, we introduce two large-scale real-world datasets with 0.2
and 1.3 billion temporal edges. We evaluate the performance of TGL on four
small-scale datasets with a single GPU and the two large datasets with multiple
GPUs for both link prediction and node classification tasks. We compare TGL
with the open-sourced code of five methods and show that TGL achieves similar
or better accuracy with an average of 13x speedup. Our temporal parallel
sampler achieves an average of 173x speedup on a multi-core CPU compared with
the baselines. On a 4-GPU machine, TGL can train one epoch of more than one
billion temporal edges within 1-10 hours. To the best of our knowledge, this is
the first work that proposes a general framework for large-scale Temporal Graph
Neural Networks training on multiple GPUs.

本文提出了 TGL，一个适用于大规模离线图神经网络的统一框架，在多个 GPU 上进行训练，该框架包括时间采样器、邮箱、节点内存模块、内存更新器和消息传递引擎等五个主要组件，并通过随机块调度技术解决了训练大批量样本时过时的节点内存等问题。在多个小规模和两个大规模数据集上的实验结果表明，TGL 可以实现更快的训练速度和类似或更好的准确性。

TGL：针对数十亿规模图的时间 GNN 训练的通用框架

TGL: A General Framework for Temporal GNN Training on Billion-Scale  Graphs

Training modern deep learning models requires large amounts of computation,
often provided by GPUs. Scaling computation from one GPU to many can enable
much faster training and research progress but entails two complications.
First, the training library must support inter-GPU communication. Depending on
the particular methods employed, this communication may entail anywhere from
negligible to significant overhead. Second, the user must modify his or her
training code to take advantage of inter-GPU communication. Depending on the
training library's API, the modification required may be either significant or
minimal.
Existing methods for enabling multi-GPU training under the TensorFlow library
entail non-negligible communication overhead and require users to heavily
modify their model-building code, leading many researchers to avoid the whole
mess and stick with slower single-GPU training. In this paper we introduce
Horovod, an open source library that improves on both obstructions to scaling:
it employs efficient inter-GPU communication via ring reduction and requires
only a few lines of modification to user code, enabling faster, easier
distributed training in TensorFlow. Horovod is available under the Apache 2.0
license at this https URL

本文介绍了 Horovod，它是一个开源库，可通过 ring reductions 实现高效的跨 GPU 通信，只需要对用户代码进行少量修改即可在 TensorFlow 中实现更快、更容易的分布式训练。