Alignment in artificial intelligence pursues the consistency between model
responses and human preferences as well as values. In practice, the
multifaceted nature of human preferences inadvertently introduces what is known
as the "alignment tax" -a compromise where enhancements in alignment within one
objective (e.g.,harmlessness) can diminish performance in others
(e.g.,helpfulness). However, existing alignment techniques are mostly
unidirectional, leading to suboptimal trade-offs and poor flexibility over
various objectives. To navigate this challenge, we argue the prominence of
grounding LLMs with evident preferences. We introduce controllable preference
optimization (CPO), which explicitly specifies preference scores for different
objectives, thereby guiding the model to generate responses that meet the
requirements. Our experimental analysis reveals that the aligned models can
provide responses that match various preferences among the "3H" (helpfulness,
honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and
alignment goals, we surpass baseline methods in aligning with single
objectives, hence mitigating the impact of the alignment tax and achieving
Pareto improvements in multi-objective alignment.

通过引入可控偏好优化（CPO），我们可以实现模型响应满足不同目标需求的对齐模型，并在多目标对齐中获得 Pareto 改进。