BriefGPT.xyz
Feb, 2024
基于梯度下降的Transformer网络学习因果结构
How Transformers Learn Causal Structure with Gradient Descent
HTML
PDF
Eshaan Nichani, Alex Damian, Jason D. Lee
TL;DR
通过梯度下降优化算法,变压器模型通过自注意机制在第一个注意力层中编码潜在的因果图,从而学习了因果结构。
Abstract
The incredible success of
transformers
on sequence modeling tasks can be largely attributed to the
self-attention mechanism
, which allows information to be transferred between different parts of a sequence. Self-
→