BriefGPT.xyz
Nov, 2024
多层变换器中堆叠注意力头的机制与产生
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
HTML
PDF
Tiberiu Musat
TL;DR
本文介绍了检索问题,这是一项仅可通过具有最小层数的变换器解决的简单推理任务。研究发现,大型语言模型能够在不同提示下无须微调地解决该任务,成功学习依赖于隐性课程的存在,并且注意力头的出现遵循特定的顺序,从而揭示了变换器解决检索问题的机制。
Abstract
In this paper, I introduce the
Retrieval Problem
, a simple reasoning task that can be solved only by
Transformers
with a minimum number of layers. The task has an adjustable difficulty that can further increase t
→