The ability of large language models (LLMs) to follow instructions is crucial
to real-world applications. Despite recent advances, several studies have
highlighted that LLMs struggle when faced with challenging instructions,
especially those that include complex constraints, hindering their
effectiveness in various tasks. To address this challenge, we introduce
Conifer, a novel instruction tuning dataset, designed to enhance LLMs to follow
multi-level instructions with complex constraints. Utilizing GPT-4, we curate
the dataset by a series of LLM-driven refinement processes to ensure high
quality. We also propose a progressive learning scheme that emphasizes an
easy-to-hard progression, and learning from process feedback. Models trained
with Conifer exhibit remarkable improvements in instruction-following
abilities, especially for instructions with complex constraints. On several
instruction-following benchmarks, our 7B model outperforms the state-of-the-art
open-source 7B models, even exceeds the performance of models 10 times larger
on certain metrics. All the code and Conifer dataset are available at
this https URL

通过引入名为 Conifer 的新型指令调整数据集，以及采用渐进学习方案和学习过程反馈，我们提高了大型语言模型（LLMs）在遵循具有复杂约束的多级指令方面的能力，并在几个指令遵循基准测试中，实现了与现有 7B 模型相比的显著改进，甚至在某些度量标准上超过了 10 倍大的模型的性能。

Conifer: 提高大型语言模型复杂约束指令遵循能力

Conifer: Improving Complex Constrained Instruction-Following Ability of  Large Language Models

Recent advances in constrained reinforcement learning (RL) have endowed
reinforcement learning with certain safety guarantees. However, deploying
existing constrained RL algorithms in continuous control tasks with general
hard constraints remains challenging, particularly in those situations with
non-convex hard constraints. Inspired by the generalized reduced gradient (GRG)
algorithm, a classical constrained optimization technique, we propose a reduced
policy optimization (RPO) algorithm that combines RL with GRG to address
general hard constraints. RPO partitions actions into basic actions and
nonbasic actions following the GRG method and outputs the basic actions via a
policy network. Subsequently, RPO calculates the nonbasic actions by solving
equations based on equality constraints using the obtained basic actions. The
policy network is then updated by implicitly differentiating nonbasic actions
with respect to basic actions. Additionally, we introduce an action projection
procedure based on the reduced gradient and apply a modified Lagrangian
relaxation technique to ensure inequality constraints are satisfied. To the
best of our knowledge, RPO is the first attempt that introduces GRG to RL as a
way of efficiently handling both equality and inequality hard constraints. It
is worth noting that there is currently a lack of RL environments with complex
hard constraints, which motivates us to develop three new benchmarks: two
robotics manipulation tasks and a smart grid operation control task. With these
benchmarks, RPO achieves better performance than previous constrained RL
algorithms in terms of both cumulative reward and constraint violation. We
believe RPO, along with the new benchmarks, will open up new opportunities for
applying RL to real-world problems with complex constraints.

近期有关约束强化学习的研究进展为强化学习提供了一定的安全性保证。本文介绍了一种将 RL 与 GRG 相结合的减少策略优化算法 (RPO)，用于处理存在非凸硬约束条件的连续控制任务。通过将动作分为基本动作和非基本动作，RPO 算法采用了 GRG 的方法生成基本动作，并通过等式约束求解得到非基本动作。另外，还引入了基于减少梯度的动作投影过程，并应用改进的拉格朗日松弛技术来确保不等式约束得到满足。此外，为了解决目前缺乏复杂硬约束环境的问题，我们开发了三个新的基准测试任务：两个机器人操作任务和一个智能电网运行控制任务。通过这些基准测试，RPO 算法在累积奖励和约束违规方面显示出比之前的约束强化学习算法更好的性能。我们相信 RPO 算法及其新的基准测试将为将 RL 应用于具有复杂约束的现实问题打开新的机遇。