During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

通过使用低秩调整方法对 RLHF 进行改进，本研究使用仅两个 A100 GPU 就能够将 LLaMA 7B 检查点与 Alpaca 数据集对齐，并在仅调整了 0.2% 参数的情况下，比全模型微调的公开发布的 AlpacaFarm 检查点取得更好的性能。同时，我们发现 Jensen-Shannon 距离作为正则化项的效果更好，并且通过使用 LoRA 进行训练能够在一定程度上减少 PPO 训练对模型生成回答的准确性的负面影响。

探索低秩调整对RLHF的性能、效率和正则化的影响