Safety and trustworthiness are indispensable requirements for applying AI systems based on large language models (LLMs) in real-world applications. This paper formulates a human value alignment as a language model policy optimization problem to maximize reward under a safety constraint and then proposes an algorithm called Stepwise Alignment for Constrained Policy Optimization (SACPO). A key idea behind SACPO, supported by theory, is that the optimal policy incorporating both reward and safety can be directly obtained from a reward-aligned policy. Based on this key idea, SACPO aligns the LLMs with each metric step-wise while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO provides many benefits such as simplicity, stability, computational efficiency, and flexibility regarding algorithms and dataset selection. Under mild assumption, our theoretical analysis provides the upper bounds regarding near-optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness

这篇论文提出了一种基于大型语言模型（LLMs）的人类价值对齐作为语言模型策略优化问题的方法，以在安全约束下最大化奖励，并提出了一种名为SACPO的算法。通过直接优化偏好方法等简单而强大的对齐算法，SACPO可以逐步对齐LLMs与每个度量标准，并在算法和数据集选择方面提供了简单性、稳定性、计算效率和灵活性。在温和假设下，我们的理论分析提供了近似最优性和安全约束违反的上界。实验结果表明，SACPO在有益性和无害性方面可以比最先进的方法更好地调整Alpaca-7B。

约束语言模型策略优化的逐步对齐