This article presents a theoretical evaluation of the computational
universality of decoder-only transformer models. We extend the theoretical
literature on transformer models and show that decoder-only transform
本文通过对传统 encoder-decoder 和 decoder-only language model 结构的对比分析,揭示了 decoder-only language model 存在的注意力退化问题,并提出了 partial attention language model 来解决这一问题。