Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

将多个专家语言模型合并成单一多功能模型的成本效益技术中，当前方法经常忽视了合并过程中安全对齐的重要性，导致模型高度不对齐。本研究调查了模型合并对对齐的影响，评估了几种常见的模型合并技术，证明现有方法不仅传递了领域专业知识，还传播了错对齐。我们提出了一个简单的两步方法来解决这个问题：(i)生成合成的安全性和领域特定数据，和(ii)将这些生成的数据纳入到现有数据感知的模型合并技术的优化过程中。这样，我们可以将对齐视为一项可以在合并后的多功能语言模型中最大化的技能。我们的实验表明，在合并过程中整合与对齐相关的数据的有效性，产生了在领域专业知识和对齐度方面都优秀的模型。

模型合并与安全对齐：一枚坏模型败坏一群模型