Reinforcement learning from human feedback (RLHF) has emerged as a powerful
technique to make large language models (LLMs) easier to prompt and more
capable in complex settings. RLHF at its core is providing a new toolkit to
optimize LLMs other than next-token prediction, enabling the integration of
qualitative training goals. The attempted match between user preferences and
downstream performance, which happens in a learned reward model, results in an
optimization landscape where training and evaluation metrics can appear
correlated. The apparent correlation can lead to unexpected behaviors and
stories of "too much RLHF." In RLHF, challenges emerge because the following
sub-modules are not consistent with each other: the reward model training, the
policy model training, and the policy model evaluation. This mismatch results
in models that sometimes avoid user requests for false safety flags, are
difficult to steer to an intended characteristic, or always answer in a
specific style. As chat model evaluation becomes increasingly nuanced, the
reliance on a perceived link between reward model score and downstream
performance drives the objective mismatch issue. In this paper, we illustrate
the cause of this issue, reviewing relevant literature from model-based
reinforcement learning, and discuss relevant solutions to encourage further
research. By solving objective mismatch in RLHF, the LLMs of the future will be
more precisely aligned to user instructions for both safety and helpfulness.

用来自人类反馈的强化学习技术已经成为一个强大的工具，使得大型语言模型在复杂环境中更容易引导，更具能力。然而，由于奖励模型、策略模型和评估模型之间的不一致性，存在目标不匹配的问题。本文探讨了这个问题的原因，并回顾了相关的模型学习和强化学习文献。同时，讨论了激励解匹配之后的解决方案，以促进进一步的研究，从而使未来的语言模型更加准确地遵循用户的指令，提供更安全和有用的服务。