The development of Large Language Models (LLMs) often confronts challenges
stemming from the heavy reliance on human annotators in the reinforcement
learning with human feedback (RLHF) framework, or the frequent and costly
external queries tied to the self-instruct paradigm. In this work, we pivot to
Reinforcement Learning (RL) -- but with a twist. Diverging from the typical
RLHF, which refines LLMs following instruction data training, we use RL to
directly generate the foundational instruction dataset that alone suffices for
fine-tuning. Our method, TeaMs-RL, uses a suite of textual operations and
rules, prioritizing the diversification of training datasets. It facilitates
the generation of high-quality data without excessive reliance on external
advanced models, paving the way for a single fine-tuning step and negating the
need for subsequent RLHF stages. Our findings highlight key advantages of our
approach: reduced need for human involvement and fewer model queries (only
$5.73\%$ of WizardLM's total), along with enhanced capabilities of LLMs in
crafting and comprehending complex instructions compared to strong baselines,
and substantially improved model privacy protection.

通过使用增强学习直接生成基础指令数据集，TeaMs-RL 方法能够在单一微调步骤中提高大型语言模型的能力，减少人为参与需求、模型查询次数以及提高模型隐私保护能力。

TeaMs-RL：通过强化学习教授 LLMs 更好地自我指导

TeaMs-RL: Teaching LLMs to Teach Themselves Better Instructions via  Reinforcement Learning

Federated learning (FL) enables multiple participants to collaboratively
train machine learning models using decentralized data sources, alleviating
privacy concerns that arise from directly sharing local data. However, the lack
of model privacy protection in FL becomes an unneglectable challenge,
especially when people want to federally finetune models based on a proprietary
large language model. In this study, we propose a novel FL training approach
that accomplishes information exchange among participants via tunable soft
prompts. These soft prompts, updated and transmitted between the server and
clients, assume the role of the global model parameters and serve as messengers
to deliver useful knowledge from the local data and global model. As the global
model itself is not required to be shared and the local training is conducted
based on an auxiliary model with fewer parameters than the global model, the
proposed approach provides protection for the global model while reducing
communication and computation costs in FL. Extensive experiments show the
effectiveness of the proposed approach compared to several baselines. We have
released the source code at
https://github.com/alibaba/FederatedScope/tree/fedsp/federatedscope/nlp/fedsp.

通过可调整的软提示实现参与者之间的信息交流，以在损失较少的全局模型的基础上保护全局模型，减少联邦学习中的通信和计算成本。