We posit that to achieve superhuman agents, future models require superhuman
feedback in order to provide an adequate training signal. Current approaches
commonly train reward models from human preferences, which may then be
bottlenecked by human performance level, and secondly these separate frozen
reward models cannot then learn to improve during LLM training. In this work,
we study Self-Rewarding Language Models, where the language model itself is
used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction
following ability improve, but also the ability to provide high-quality rewards
to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a
model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard,
including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study,
this work opens the door to the possibility of models that can continually
improve in both axes.

通过自我奖励语言模型的迭代 DPO 训练，本研究展示了模型的指示遵循能力及为自身提供高质量奖励的能力的提升，最终的 Llama 2 70B 模型在 AlpacaEval 2.0 排行榜上表现优于许多现有系统，包括 Claude 2、Gemini Pro 和 GPT-4 0613。这项初步研究为模型在两个方向上不断改进的可能性打开了大门。

自奖励语言模型

Self-Rewarding Language Models

Language agents have shown some ability to interact with an external
environment, e.g., a virtual world such as ScienceWorld, to perform complex
tasks, e.g., growing a plant, without the startup costs of reinforcement
learning. However, despite their zero-shot capabilities, these agents to date
do not continually improve over time beyond performance refinement on a
specific task. Here we present CLIN, the first language-based agent to achieve
this, so that it continually improves over multiple trials, including when both
the environment and task are varied, and without requiring parameter updates.
Our approach is to use a persistent, dynamic, textual memory centered on causal
abstractions (rather than general "helpful hints") that is regularly updated
after each trial so that the agent gradually learns useful knowledge for new
trials. In the ScienceWorld benchmark, CLIN is able to continually improve on
repeated trials on the same task and environment, outperforming
state-of-the-art reflective language agents like Reflexion by 23 absolute
points. CLIN can also transfer its learning to new environments (or new tasks),
improving its zero-shot performance by 4 points (13 for new tasks) and can
further improve performance there through continual memory updates, enhancing
performance by an additional 17 points (7 for new tasks). This suggests a new
architecture for agents built on frozen models that can still continually and
rapidly improve over time.

CLIN 是第一个语言驱动的智能体，它通过持续更新的文本内存，不断改进表现并能够在变化的环境和任务中迁移学习，使得智能体的性能逐渐提升。

CLIN: 一个用于快速任务适应和泛化的持续学习语言代理

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation  and Generalization

Dialogue systems, commonly known as chatbots, have gained escalating
popularity in recent times due to their wide-spread applications in carrying
out chit-chat conversations with users and task-oriented dialogues to
accomplish various user tasks. Existing chatbots are usually trained from
pre-collected and manually-labeled data and/or written with handcrafted rules.
Many also use manually-compiled knowledge bases (KBs). Their ability to
understand natural language is still limited, and they tend to produce many
errors resulting in poor user satisfaction. Typically, they need to be
constantly improved by engineers with more labeled data and more manually
compiled knowledge. This book introduces the new paradigm of lifelong learning
dialogue systems to endow chatbots the ability to learn continually by
themselves through their own self-initiated interactions with their users and
working environments to improve themselves. As the systems chat more and more
with users or learn more and more from external sources, they become more and
more knowledgeable and better and better at conversing. The book presents the
latest developments and techniques for building such continual learning
dialogue systems that continuously learn new language expressions and lexical
and factual knowledge during conversation from users and off conversation from
external sources, acquire new training examples during conversation, and learn
conversational skills. Apart from these general topics, existing works on
continual learning of some specific aspects of dialogue systems are also
surveyed. The book concludes with a discussion of open challenges for future
research.

这本书介绍了一种新的对话系统学习方法，即通过自身与用户和环境的交互来学习，实现从用户和外部来源不断学习语言表达、词汇和 factual 知识、训练样本和会话技能等方面的持续改进。除了总体论述，书中还介绍了一些特定话题的持续学习方法，并探讨了未来研究的挑战。

终身和持续学习对话系统

Lifelong and Continual Learning Dialogue Systems

Self-replication is a key aspect of biological life that has been largely
overlooked in Artificial Intelligence systems. Here we describe how to build
and train self-replicating neural networks. The network replicates itself by
learning to output its own weights. The network is designed using a loss
function that can be optimized with either gradient-based or non-gradient-based
methods. We also describe a method we call regeneration to train the network
without explicit optimization, by injecting the network with predictions of its
own parameters. The best solution for a self-replicating network was found by
alternating between regeneration and optimization steps. Finally, we describe a
design for a self-replicating neural network that can solve an auxiliary task
such as MNIST image classification. We observe that there is a trade-off
between the network's ability to classify images and its ability to replicate,
but training is biased towards increasing its specialization at image
classification at the expense of replication. This is analogous to the
trade-off between reproduction and other tasks observed in nature. We suggest
that a self-replication mechanism for artificial intelligence is useful because
it introduces the possibility of continual improvement through natural
selection.

本文描述如何构建和训练自我复制的神经网络，其中网络通过学习输出自己的权重来实现复制，并使用称为再生的方法来训练网络。通过在复制和优化步骤之间交替进行，该自我复制神经网络可以解决 MNIST 图像分类等辅助任务。同时，该文提出自我复制机制对人工智能是有用的，因为它引入了通过自然选择实现持续改进的可能性。