Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.

该研究通过理论探索首次分析了浅层图变换器在半监督节点分类中的应用。它使用了自注意力和位置编码，并描述了实现理想的泛化误差所需的样本复杂度和迭代次数的定量特征。此外，文中还展示了自注意力和位置编码如何通过稀疏化注意力图和在训练过程中促进核心邻域，从而增强了图变换器的特征表示能力。实验证明了我们的理论结果。

图转换器泛化能力的提升方法：关注力机制和位置编码的理论探讨