BriefGPT.xyz
Jun, 2019
对话中隐含人类偏好的大规模脱靶批次深度强化学习
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
HTML
PDF
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza...
TL;DR
提出了一种新型的基于批处理的深度强化学习算法,可以在没有在线探索的情况下有效地从人类交互数据的固定批量中进行离线学习,并在开放域对话生成等领域取得了显著的改进。
Abstract
Most
deep reinforcement learning
(RL) systems are not able to learn effectively from
off-policy data
, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL t
→