Transformer-based pre-trained models have gained much advance in recent
years, becoming one of the most important backbones in natural language
processing. Recent work shows that the attention mechanism inside Transformer
may not be necessary, both convolutional neural networks and multi-layer
perceptron based models have also been investigated as Transformer
alternatives. In this paper, we consider a graph recurrent network for language
model pre-training, which builds a graph structure for each sequence with local
token-level communications, together with a sentence-level representation
decoupled from other tokens. The original model performs well in
domain-specific text classification under supervised training, however, its
potential in learning transfer knowledge by self-supervised way has not been
fully exploited. We fill this gap by optimizing the architecture and verifying
its effectiveness in more general language understanding tasks, for both
English and Chinese languages. As for model efficiency, instead of the
quadratic complexity in Transformer-based models, our model has linear
complexity and performs more efficiently during inference. Moreover, we find
that our model can generate more diverse outputs with less contextualized
feature redundancy than existing attention-based models.

本研究提出了一种基于图循环网络的语言模型预训练方法，其在性能、效率和生成多样性方面优于基于注意力机制的 Transformer，用于自监督学习的时候有较高的潜力。

预训练图循环网络用于语言表示

Pre-Training a Graph Recurrent Network for Language Representation

In this work, we investigate the positional encoding methods used in language
pre-training (e.g., BERT) and identify several problems in the existing
formulations. First, we show that in the absolute positional encoding, the
addition operation applied on positional embeddings and word embeddings brings
mixed correlations between the two heterogeneous information resources. It may
bring unnecessary randomness in the attention and further limit the
expressiveness of the model. Second, we question whether treating the position
of the symbol \texttt{[CLS]} the same as other words is a reasonable design,
considering its special role (the representation of the entire sentence) in the
downstream tasks. Motivated from above analysis, we propose a new positional
encoding method called \textbf{T}ransformer with \textbf{U}ntied
\textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module,
TUPE computes the word contextual correlation and positional correlation
separately with different parameterizations and then adds them together. This
design removes the mixed and noisy correlations over heterogeneous embeddings
and offers more expressiveness by using different projection matrices.
Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making
it easier to capture information from all positions. Extensive experiments and
ablation studies on GLUE benchmark demonstrate the effectiveness of the
proposed method. Codes and models are released at
this https URL

提出一种新的位置编码方法 TUPE，该方法通过将词的上下文相关性和位置相关性分开并采用不同的投影矩阵进行计算，并将它们相加来消除混杂和杂乱的关联。在广泛的实验和离线研究中，我们证明了该方法的有效性。