The standard Reinforcement Learning from Human Feedback (RLHF) framework
primarily focuses on optimizing the performance of large language models using
pre-collected prompts. However, collecting prompts that provide comprehensive
coverage is both tedious and challenging, and often fails to include scenarios
that LLMs need to improve on the most. In this paper, we investigate alignment
through the lens of two-agent games, involving iterative interactions between
an adversarial and a defensive agent. The adversarial agent's task at each step
is to generate prompts that expose the weakness of the defensive agent. In
return, the defensive agent seeks to improve its responses to these newly
identified prompts it struggled with, based on feedback from the reward model.
We theoretically demonstrate that this iterative reinforcement learning
optimization converges to a Nash Equilibrium for the game induced by the
agents. Experimental results in safety scenarios demonstrate that learning in
such a competitive environment not only fully trains agents but also leads to
policies with enhanced generalization capabilities for both adversarial and
defensive agents.

通过两个代理人之间的迭代互动，通过生成展现防御代理人弱点的提示并根据奖励模型的反馈改进回应，本文在安全场景中理论上证明了这种反复强化学习优化会收敛到由代理人引发的博弈的纳什均衡，并实验证明了在这样竞争环境中的学习不仅可以充分训练代理人，而且还可以提高对抗性和防御性代理人的泛化能力。

通过两人博弈实现最佳 LLM 对齐

Toward Optimal LLM Alignments Using Two-Player Games

With the widespread adoption of Large Language Models (LLMs), the prevalence
of iterative interactions among these models is anticipated to increase.
Notably, recent advancements in multi-round self-improving methods allow LLMs
to generate new examples for training subsequent models. At the same time,
multi-agent LLM systems, involving automated interactions among agents, are
also increasing in prominence. Thus, in both short and long terms, LLMs may
actively engage in an evolutionary process. We draw parallels between the
behavior of LLMs and the evolution of human culture, as the latter has been
extensively studied by cognitive scientists for decades. Our approach involves
leveraging Iterated Learning (IL), a Bayesian framework that elucidates how
subtle biases are magnified during human cultural evolution, to explain some
behaviors of LLMs. This paper outlines key characteristics of agents' behavior
in the Bayesian-IL framework, including predictions that are supported by
experimental verification with various LLMs. This theoretical framework could
help to more effectively predict and guide the evolution of LLMs in desired
directions.

本文介绍了大型语言模型（LLMs）的迭代交互，以及多代理 LLM 系统和人类文化进化之间的相似之处，并运用迭代学习（IL）贝叶斯框架解释 LLMs 的一些行为特征，并通过实验证实了该理论框架的预测，有望更有效地预测和引导 LLMs 在期望的方向上的进化。

语言模型进化：迭代学习视角

Language Model Evolution: An Iterated Learning Perspective

Generating images with a Text-to-Image model often requires multiple trials,
where human users iteratively update their prompt based on feedback, namely the
output image. Taking inspiration from cognitive work on reference games and
dialogue alignment, this paper analyzes the dynamics of the user prompts along
such iterations. We compile a dataset of iterative interactions of human users
with Midjourney. Our analysis then reveals that prompts predictably converge
toward specific traits along these iterations. We further study whether this
convergence is due to human users, realizing they missed important details, or
due to adaptation to the model's ``preferences'', producing better images for a
specific language style. We show initial evidence that both possibilities are
at play. The possibility that users adapt to the model's preference raises
concerns about reusing user data for further training. The prompts may be
biased towards the preferences of a specific model, rather than align with
human intentions and natural manner of expression.

通过研究用户与 Text-to-Image 模型的迭代交互，分析了用户提示的动态，发现提示在迭代过程中趋于特定特征。进一步研究表明，这种趋同既可能是用户因忽略重要细节而调整，也可能是为适应模型的偏好而产生具有特定语言风格的更好图像。初步证据显示这两种可能性都存在。用户数据偏好模型的情况引发对进一步训练中重复使用用户数据的担忧，因为提示可能偏向于特定模型的偏好，而不是与人类意图和自然表达方式相一致。