Despite the promise of RLHF in aligning LLMs with human preferences, it often
leads to superficial alignment, prioritizing stylistic changes over improving
downstream performance of LLMs. Underspecified preferences could obscure
directions to align the models. Lacking exploration restricts identification of
desirable outputs to improve the models. To overcome these challenges, we
propose a novel framework: Reinforcement Learning from Reflective Feedback
(RLRF), which leverages fine-grained feedback based on detailed criteria to
improve the core capabilities of LLMs. RLRF employs a self-reflection mechanism
to systematically explore and refine LLM responses, then fine-tuning the models
via a RL algorithm along with promising responses. Our experiments across
Just-Eval, Factuality, and Mathematical Reasoning demonstrate the efficacy and
transformative potential of RLRF beyond superficial surface-level adjustment.

通过利用细致的反馈基于详细准则来改进 LLMs 的核心能力，我们提出了一种新颖的框架：反思性反馈强化学习。RLRF 采用自我反思机制来系统地探索和改进 LLM 的回答，并通过与有希望的回答一起使用 RL 算法来微调模型。我们在 Just-Eval、Factuality 和数学推理方面的实验证明了 RLRF 在超越表面层调整方面的功效和变革潜力。