In-context learning refers to the learning ability of a model during
inference time without adapting its parameters. The input (i.e., prompt) to the
model (e.g., transformers) consists of both a context (i.e., instance-label
pairs) and a query instance. The model is then able to output a label for the
query instance according to the context during inference. A possible
explanation for in-context learning is that the forward pass of (linear)
transformers implements iterations of gradient descent on the instance-label
pairs in the context. In this paper, we prove by construction that transformers
can also implement temporal difference (TD) learning in the forward pass, a
phenomenon we refer to as in-context TD. We demonstrate the emergence of
in-context TD after training the transformer with a multi-task TD algorithm,
accompanied by theoretical analysis. Furthermore, we prove that transformers
are expressive enough to implement many other policy evaluation algorithms in
the forward pass, including residual gradient, TD with eligibility trace, and
average-reward TD.

此研究论文证明了 transformers 模型在前向传播中可以实现时间差异学习（TD learning）以及其他许多策略评估算法，通过使用多任务 TD 算法进行训练，并进行了理论分析。