Traditional reinforcement learning from human feedback (RLHF) approaches
relying on parametric models like the Bradley-Terry model fall short in
capturing the intransitivity and irrationality in human preferences. Recent
advancements suggest that directly working with preference probabilities can
yield a more accurate reflection of human preferences, enabling more flexible
and accurate language model alignment. In this paper, we propose a
self-play-based method for language model alignment, which treats the problem
as a constant-sum two-player game aimed at identifying the Nash equilibrium
policy. Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO),
approximates the Nash equilibrium through iterative policy updates and enjoys
theoretical convergence guarantee. Our method can effectively increase the
log-likelihood of the chosen response and decrease that of the rejected
response, which cannot be trivially achieved by symmetric pairwise loss such as
Direct Preference Optimization (DPO) and Identity Preference Optimization
(IPO). In our experiments, using only 60k prompts (without responses) from the
UltraFeedback dataset and without any prompt augmentation, by leveraging a
pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain
a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the
state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on
AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and
the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved
without additional external supervision (e.g., responses, preferences, etc.)
from GPT-4 or other stronger language models.

提议了一种基于自对弈的语言模型对齐方法，称为 SPPO，通过迭代策略更新近似求解纳什均衡策略，可以有效提高所选择的响应的对数似然并降低所拒绝响应的对数似然，同时在多个实验中表现优于其他基于对称成对损失的方法。

语言模型对齐的自我游戏偏好优化

Self-Play Preference Optimization for Language Model Alignment

We study data corruption robustness in offline two-player zero-sum Markov
games. Given a dataset of realized trajectories of two players, an adversary is
allowed to modify an $\epsilon$-fraction of it. The learner's goal is to
identify an approximate Nash Equilibrium policy pair from the corrupted data.
We consider this problem in linear Markov games under different degrees of data
coverage and corruption. We start by providing an information-theoretic lower
bound on the suboptimality gap of any learner. Next, we propose robust versions
of the Pessimistic Minimax Value Iteration algorithm, both under coverage on
the corrupted data and under coverage only on the clean data, and show that
they achieve (near)-optimal suboptimality gap bounds with respect to
$\epsilon$. We note that we are the first to provide such a characterization of
the problem of learning approximate Nash Equilibrium policies in offline
two-player zero-sum Markov games under data corruption.

我们研究了线性马尔可夫游戏中离线两人零和游戏中的数据损坏鲁棒性问题，提出了鲁棒版本的悲观极小极大值迭代算法，并给出了关于 epsilon 的 (近) 最优次优性能界限。

腐败稳健的离线双人零和马尔可夫博弈

Corruption-Robust Offline Two-Player Zero-Sum Markov Games

We explore the problem of imitation learning (IL) in the context of
mean-field games (MFGs), where the goal is to imitate the behavior of a
population of agents following a Nash equilibrium policy according to some
unknown payoff function. IL in MFGs presents new challenges compared to
single-agent IL, particularly when both the reward function and the transition
kernel depend on the population distribution. In this paper, departing from the
existing literature on IL for MFGs, we introduce a new solution concept called
the Nash imitation gap. Then we show that when only the reward depends on the
population distribution, IL in MFGs can be reduced to single-agent IL with
similar guarantees. However, when the dynamics is population-dependent, we
provide a novel upper-bound that suggests IL is harder in this setting. To
address this issue, we propose a new adversarial formulation where the
reinforcement learning problem is replaced by a mean-field control (MFC)
problem, suggesting progress in IL within MFGs may have to build upon MFC.

本文研究了均场博弈中的模仿学习问题，引入了 Nash 模仿差作为新的解决方案，研究发现在仅收益需求受到人口分布影响时，该问题等价于单智能体模仿学习，并给出了适用于整体系统动力学的新的上界限定。

关于均场博弈中的模仿问题

On Imitation in Mean-field Games

Modern reinforcement learning (RL) commonly engages practical problems with
large state spaces, where function approximation must be deployed to
approximate either the value function or the policy. While recent progresses in
RL theory address a rich set of RL problems with general function
approximation, such successes are mostly restricted to the single-agent
setting. It remains elusive how to extend these results to multi-agent RL,
especially due to the new challenges arising from its game-theoretical nature.
This paper considers two-player zero-sum Markov Games (MGs). We propose a new
algorithm that can provably find the Nash equilibrium policy using a polynomial
number of samples, for any MG with low multi-agent Bellman-Eluder dimension --
a new complexity measure adapted from its single-agent version (Jin et al.,
2021). A key component of our new algorithm is the exploiter, which facilitates
the learning of the main player by deliberately exploiting her weakness. Our
theoretical framework is generic, which applies to a wide range of models
including but not limited to tabular MGs, MGs with linear or kernel function
approximation, and MGs with rich observations.

本文提出了一个新算法，能够有效地应用于大量状态空间问题中的多智能体强化学习，以寻找具有低复杂度的多代理贝尔曼 - 伊鲁德维度的零和马尔科夫博弈 Nash 平衡策略。