reinforcement learning from human feedback (RLHF) can improve the quality of
large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired
by growing batch reinforcement learning (RL), which
Reflection-Reinforced Self-Training (Re-ReST) leverages a reflection model to refine low-quality samples and augment self-training, enhancing the quality of samples efficiently.