Sep, 2023

高效 RLHF:降低 PPO 的内存使用

TL;DRReinforcement Learning with Human Feedback (RLHF) revolutionized language modeling by aligning models with human preferences. This paper presents the memory usage, performance, and training time analysis of memory-savings techniques for Proximal Policy Optimization (PPO). The proposed Hydra-RLHF integrates Supervised Fine-Tuning (SFT) and Reward models and dynamically turns LoRA 'off' during training, reducing memory usage and improving alignment across benchmarks. Resulting in a simple and promising solution, Hydra-PPO enables more widespread usage of RLHF.