We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.

本研究解决了单层、单头自注意力和交叉注意力机制在通用逼近能力方面的不足。我们的创新在于将单头注意力解释为一种输入域划分机制，通过工程化注意力权重以模仿目标函数的分配，证明了其能逼近紧致域上的任意连续函数，并扩展到任何Lebesgue可积函数。这一发现为单头交叉注意力提供了同样的通用逼近保证。

注意力机制、最大仿射划分与通用逼近