We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability explicit, by inputting graph edges into the attention weight computations and predicting graph edges with attention-like functions, thereby integrating explicit graphs into the latent graphs learned by pretrained Transformers. Adding iterative graph refinement provides a joint embedding of input, output, and latent graphs, allowing non-autoregressive graph prediction to optimise the complete graph without any bespoke pipeline or decoding strategy. Empirical results show that this architecture achieves state-of-the-art accuracies for modelling a variety of linguistic structures, integrating very effectively with the latent linguistic representations learned by pretraining.

我们认为Transformer模型本质上是图到图的模型，序列只是一种特殊情况。注意力权重在功能上等价于图中的边。我们的图到图Transformer架构明确地表达了这个能力，通过将图的边作为输入用于注意力权重计算，并使用类似于注意力的函数预测图中的边，从而将显式图集成到预训练的Transformer模型中学习出的潜在图中。添加迭代的图优化过程提供了输入、输出和潜在图的联合嵌入，使得非自回归图预测能够优化完整图，无需任何专门的流水线或解码策略。实证结果表明，该架构在对各种语言结构建模方面取得了最先进的准确性，与预训练学习的潜在语言表示非常有效地集成在一起。

将Transformer模型作为图到图模型