Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter
annotation cycles and lower costs over Reinforcement Learning from Human
Feedback (RLHF), making it highly efficient during the rapid strategy iteration
periods of large language model (LLM) training. Using ChatGPT as a labeler to
provide feedback on open-domain prompts in RLAIF training, we observe an
increase in human evaluators' preference win ratio for model responses, but a
decrease in evaluators' satisfaction rate. Analysis suggests that the decrease
in satisfaction rate is mainly due to some responses becoming less helpful,
particularly in terms of correctness and truthfulness, highlighting practical
limitations of basic RLAIF. In this paper, we propose Hybrid Reinforcement
Learning from AI Feedback (HRLAIF). This method enhances the accuracy of AI
annotations for responses, making the model's helpfulness more robust in
training process. Additionally, it employs AI for Red Teaming, further
improving the model's harmlessness. Human evaluation results show that HRLAIF
inherits the ability of RLAIF to enhance human preference for outcomes at a low
cost while also improving the satisfaction rate of responses. Compared to the
policy model before Reinforcement Learning (RL), it achieves an increase of
2.08\% in satisfaction rate, effectively addressing the issue of a decrease of
4.58\% in satisfaction rate after basic RLAIF.

通过使用 AI 反馈进行增强学习（RLAIF）在大型语言模型（LLM）训练的快速策略迭代阶段比通过人类反馈进行增强学习（RLHF）具有更短的注释周期和更低的成本，使其效率更高。本文提出了混合增强学习来自 AI 反馈（HRLAIF）方法，通过增强 AI 注释的准确性，使模型在训练过程中的帮助更可靠，并且通过 AI 进行红队行动，进一步提高模型的无害性。与 RL 之前的策略模型相比，HRLAIF 方法在满意率上实现了 2.08% 的增加，有效解决了基本 RLAIF 后满意率下降 4.58% 的问题。