基于状态扩展的人类偏好强化学习方法

Feb, 2023

基于状态扩展的人类偏好强化学习方法

A State Augmentation based approach to Reinforcement Learning from Human Preferences

Mudit Verma, Subbarao Kambhampati

TL;DR本文提出了一种状态增强技术，利用二元反馈帮助人类进一步了解代理行为来学习奖励模型为强化学习提供更好的支持，并在三种任务领域 Mountain Car、Quadruped-Walk 和 Sweep-Into 中验证了其有效性。

Abstract

reinforcement learning has suffered from poor reward specification, and issues for reward hacking even in simple enough domains. Preference Based reinforcement learning attempts to solve the issue by utilizing bi