Recent research has shown the potential of Nash Learning via Human Feedback
for large language model alignment by incorporating the notion of a preference
model in a minimax game setup. We take this idea further by casting the
alignment as a mirror descent algorithm against the adaptive feedback of an
improved opponent, thereby removing the need for learning a preference model or
the existence of an annotated dataset altogether. The resulting algorithm,
which we refer to as Language Alignment via Nash-learning and Adaptive feedback
(LANA), is capable of self-alignment without the need for a human-annotated
preference dataset. We support this statement with various experiments and
mathematical discussion.

借助 Nash 学习和自适应反馈的语言对齐算法（LANA）消除了学习偏好模型或存在注释数据集的需求，实现了大规模语言模型对齐的自我对齐能力。

通过 Nash 学习和自适应反馈进行语言对齐

Language Alignment via Nash-learning and Adaptive feedback

Reinforcement Learning from Human Feedback (RLHF) learns from the preference
signal provided by a probabilistic preference model, which takes a prompt and
two responses as input, and produces a score indicating the preference of one
response against another. So far, the most popular RLHF paradigm is
reward-based, which starts with an initial step of reward modeling, and the
constructed reward is then used to provide a reward signal for the subsequent
reward optimization stage. However, the existence of a reward function is a
strong assumption and the reward-based RLHF is limited in expressivity and
cannot capture the real-world complicated human preference.
In this work, we provide theoretical insights for a recently proposed
learning paradigm, Nash learning from human feedback (NLHF), which considered a
general preference model and formulated the alignment process as a game between
two competitive LLMs. The learning objective is to find a policy that
consistently generates responses preferred over any competing policy while
staying close to the initial model. The objective is defined as the Nash
equilibrium (NE) of the KL-regularized preference model. We aim to make the
first attempt to study the theoretical learnability of the KL-regularized NLHF
by considering both offline and online settings. For the offline learning from
a pre-collected dataset, we propose algorithms that are efficient under
suitable coverage conditions of the dataset. For batch online learning from
iterative interactions with a preference oracle, our proposed algorithm enjoys
a finite sample guarantee under the structural condition of the underlying
preference model. Our results connect the new NLHF paradigm with traditional RL
theory, and validate the potential of reward-model-free learning under general
preference.

这篇论文研究了基于人类反馈的强化学习方法，通过一种概率偏好模型来学习，实验了一种新的学习范式，KL 正则化 NLHF，旨在找到以初始模型为基础，持续生成优于竞争策略的响应的策略，并将其与传统的强化学习理论联系起来，验证了无奖励模型学习在一般偏好下的潜力。