Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant
potential across various domains, including mitigating harm in LLM outputs,
enhancing text summarization, and mathematical reasoning. This paper introduces
an RLAIF framework for improving the code generation abilities of lightweight
(<1B parameters) LLMs. We specifically focus on code generation tasks that
require writing appropriate API calls, which is challenging due to the
well-known issue of hallucination in LLMs. Our framework extracts AI feedback
from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and
uses this data to train a reward model towards better alignment from smaller
LLMs. We run our experiments on the Gorilla dataset and meticulously assess the
quality of the model-generated code across various metrics, including AST,
ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate
accurately. Our approach significantly enhances the fine-tuned LLM baseline's
performance, achieving a 4.5% improvement in executability rate. Notably, a
smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger
fine-tuned baseline with 7B parameters, achieving a 1.0% higher code
executability rate.

使用 AI 反馈的强化学习（RLAIF）已在多个领域展示了巨大的潜力，包括减少 LLM 输出中的伤害、提升文本摘要以及数学推理等。本文引入了一个 RLAIF 框架，用于提高轻量级（小于 1B 参数）LLMs 的代码生成能力，特别关注需要编写适当 API 调用的代码生成任务，并通过专门的提示策略从更大的 LLM（例如 GPT-3.5）中提取 AI 反馈数据，用于训练更小 LLMs 的奖励模型以实现更好的对齐。我们在 Gorilla 数据集上运行实验，并通过 AST、ROUGE 和 Code-BLEU 等多个指标精确评估模型生成的代码的质量，并开发一个能够准确计算其可执行性率的流程。我们的方法显著提升了微调 LLM 基线的性能，使可执行性率提高了 4.5%。值得注意的是，使用 RLAIF 训练的一个更小的 LLM 模型（780M 参数）超过了一个具有 7B 参数的更大的微调基线，使得代码的可执行性率提高了 1.0%。

应用 RLAIF 用于轻量级 LLMs 中的 API 使用的代码生成

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter
annotation cycles and lower costs over Reinforcement Learning from Human
Feedback (RLHF), making it highly efficient during the rapid strategy iteration
periods of large language model (LLM) training. Using ChatGPT as a labeler to
provide feedback on open-domain prompts in RLAIF training, we observe an
increase in human evaluators' preference win ratio for model responses, but a
decrease in evaluators' satisfaction rate. Analysis suggests that the decrease
in satisfaction rate is mainly due to some responses becoming less helpful,
particularly in terms of correctness and truthfulness, highlighting practical
limitations of basic RLAIF. In this paper, we propose Hybrid Reinforcement
Learning from AI Feedback (HRLAIF). This method enhances the accuracy of AI
annotations for responses, making the model's helpfulness more robust in
training process. Additionally, it employs AI for Red Teaming, further
improving the model's harmlessness. Human evaluation results show that HRLAIF
inherits the ability of RLAIF to enhance human preference for outcomes at a low
cost while also improving the satisfaction rate of responses. Compared to the
policy model before Reinforcement Learning (RL), it achieves an increase of
2.08\% in satisfaction rate, effectively addressing the issue of a decrease of
4.58\% in satisfaction rate after basic RLAIF.

通过使用 AI 反馈进行增强学习（RLAIF）在大型语言模型（LLM）训练的快速策略迭代阶段比通过人类反馈进行增强学习（RLHF）具有更短的注释周期和更低的成本，使其效率更高。本文提出了混合增强学习来自 AI 反馈（HRLAIF）方法，通过增强 AI 注释的准确性，使模型在训练过程中的帮助更可靠，并且通过 AI 进行红队行动，进一步提高模型的无害性。与 RL 之前的策略模型相比，HRLAIF 方法在满意率上实现了 2.08% 的增加，有效解决了基本 RLAIF 后满意率下降 4.58% 的问题。

HRLAIF：通过 AI 反馈在开放域强化学习中的有用性和无害性改进

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain  Reinforcement Learning From AI Feedback

This paper proposes an interpretation of RLAIF as Bayesian inference by
introducing distilled Self-Critique (dSC), which refines the outputs of a LLM
through a Gibbs sampler that is later distilled into a fine-tuned model. Only
requiring synthetic data, dSC is exercised in experiments regarding safety,
sentiment, and privacy control, showing it can be a viable and cheap
alternative to align LLMs. Code released at
https://github.com/vicgalle/distilled-self-critique.

本文通过引入精简自我批判（dSC）将 RLAIF 的解释视为贝叶斯推理，通过 Gibbs 采样器对 LLM 的输出进行改进并提炼为经过调整的模型。只需合成数据，dSC 在安全、情感和隐私控制实验中表现出能够成为与 LLMs 相符的可行且廉价的替代方法。代码可在 https://github.com/vicgalle/distilled-self-critique 下载。