The development of Large Language Models (LLMs) often confronts challenges
stemming from the heavy reliance on human annotators in the reinforcement
learning with human feedback (RLHF) framework, or the frequent and costly
external queries tied to the self-instruct paradigm. In this work, we pivot to
Reinforcement Learning (RL) -- but with a twist. Diverging from the typical
RLHF, which refines LLMs following instruction data training, we use RL to
directly generate the foundational instruction dataset that alone suffices for
fine-tuning. Our method, TeaMs-RL, uses a suite of textual operations and
rules, prioritizing the diversification of training datasets. It facilitates
the generation of high-quality data without excessive reliance on external
advanced models, paving the way for a single fine-tuning step and negating the
need for subsequent RLHF stages. Our findings highlight key advantages of our
approach: reduced need for human involvement and fewer model queries (only
$5.73\%$ of WizardLM's total), along with enhanced capabilities of LLMs in
crafting and comprehending complex instructions compared to strong baselines,
and substantially improved model privacy protection.

通过使用增强学习直接生成基础指令数据集，TeaMs-RL 方法能够在单一微调步骤中提高大型语言模型的能力，减少人为参与需求、模型查询次数以及提高模型隐私保护能力。