We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the
pervasive issue of reward over-optimization in Reinforcement Learning from
Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization
occurs when a reward model serves as an imperfect proxy for human preference,
and RL-driven policy optimization erroneously exploits reward inaccuracies. In
this paper, we begin by introducing a lightweight way to quantify uncertainties
in rewards, relying solely on the last layer embeddings of the reward model,
without the need for computationally expensive reward ensembles. AdvPO then
addresses a distributionally robust optimization problem centred around the
confidence interval of the reward model's predictions for policy improvement.
Through comprehensive experiments on the Anthropic HH and TL;DR summarization
datasets, we illustrate the efficacy of AdvPO in mitigating the
overoptimization issue, consequently resulting in enhanced performance as
evaluated through human-assisted evaluation.

引入对抗性策略优化 (AdvPO) 作为一种解决强化学习从人类反馈中的奖励过度优化问题的新方法，通过对奖励模型的不确定性进行量化，并通过分布鲁棒优化处理奖励模型的置信区间，从而增强性能。

通过轻量级不确定性估计的对抗策略优化克服奖励过度优化

Overcoming Reward Overoptimization via Adversarial Policy Optimization  with Lightweight Uncertainty Estimation

The policy represented by the deep neural network can overfit the spurious
features in observations, which hamper a reinforcement learning agent from
learning effective policy. This issue becomes severe in high-dimensional state,
where the agent struggles to learn a useful policy. Data augmentation can
provide a performance boost to RL agents by mitigating the effect of
overfitting. However, such data augmentation is a form of prior knowledge, and
naively applying them in environments might worsen an agent's performance. In
this paper, we propose a novel RL algorithm to mitigate the above issue and
improve the efficiency of the learned policy. Our approach consists of a
max-min game theoretic objective where a perturber network modifies the state
to maximize the agent's probability of taking a different action while
minimizing the distortion in the state. In contrast, the policy network updates
its parameters to minimize the effect of perturbation while maximizing the
expected future reward. Based on this objective, we propose a practical deep
reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our
method is agnostic to the type of policy optimization, and thus data
augmentation can be incorporated to harness the benefit. We evaluated our
approaches on several DeepMind Control robotic environments with
high-dimensional and noisy state settings. Empirical results demonstrate that
our method APO consistently outperforms the state-of-the-art on-policy PPO
agent. We further compare our method with state-of-the-art data augmentation,
RAD, and regularization-based approach DRAC. Our agent APO shows better
performance compared to these baselines.

本文提出了一种新的强化学习算法 APO，该算法利用 max-min 博弈理论减轻数据扩充带来的过拟合问题，提高了学习策略的效率，并对几个 DeepMind 控制机器人环境的高维度和噪声状态设置进行了评估。实证结果表明，我们的方法 APO 在性能上始终优于最先进的基于策略的 PPO 代理，并且与最先进的数据增强，RAD 和基于正式的 DRAC 等方法进行了比较。