We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.

本研究解决了自注意力模型在序列映射中的普适逼近性问题。通过引入一种新的基于插值的方法，证明了两层自注意力和一层自注意力后接软max函数可以逼近任意连续函数。研究结果表明，仅使用两层多头注意力即可实现序列到序列的普适逼近，显示出该方法在上下文中近似多种统计模型的潜力。

软max注意力的普适逼近性