Human preference alignment is critical in building powerful and reliable
large language models (LLMs). However, current methods either ignore the
multi-dimensionality of human preferences (e.g. helpfulness and harmlessness)
or struggle with the complexity of managing multiple reward models. To address
these issues, we propose Sequential Preference Optimization (SPO), a method
that sequentially fine-tunes LLMs to align with multiple dimensions of human
preferences. SPO avoids explicit reward modeling, directly optimizing the
models to align with nuanced human preferences. We theoretically derive
closed-form optimal SPO policy and loss function. Gradient analysis is
conducted to show how SPO manages to fine-tune the LLMs while maintaining
alignment on previously optimized dimensions. Empirical results on LLMs of
different size and multiple evaluation datasets demonstrate that SPO
successfully aligns LLMs across multiple dimensions of human preferences and
significantly outperforms the baselines.

通过顺序优化方法，本研究提出了一种解决大规模语言模型对齐人类偏好多维度问题的方法，避免了显式奖励建模，并在人类偏好的多个维度上实现了对齐，实验证明其优于基线模型。

SPO：多维偏好顺序对齐与隐式奖励建模

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit  Reward Modeling

Large Language Models (LLMs) rely on Human Preference Alignment (HPA) to
ensure the generation of safe content. Due to the heavy cost associated with
fine-tuning, fine-tuning-free methods have emerged, typically modifying LLM
decoding with external auxiliary methods. However, these methods do not
essentially enhance the LLM itself. In this paper, we rethink the derivation
procedures of DPO, based on which we conversely build an instant scorer using
the states of the LLM before and after In-context Learning (ICL). Accordingly,
we propose a novel approach called In-Context Direct Preference Optimization
(ICDPO). It enables LLMs to borrow the HPA capabilities from superior LLMs with
ICL, generating well-aligned responses as estimated by the aforementioned
instant scorer, thereby enhancing the final performance. ICDPO can be further
enhanced with a two-stage retriever and an upgraded scorer, both offering
benefits. Extensive experiments show its effectiveness, particularly in
outperforming two fine-tuning-free baselines, and it exhibits competitiveness
with SFT + LoRA. We also conduct detailed analyses to offer comprehensive
insights into ICDPO.

通过重新思考 DPO 的推导过程，并基于此，借鉴了 ICL 前后 LLM 的状态建立了一个瞬时评分器，从而提出了一种名为 ICDPO 的新方法，使得 LLM 能够借助具有 ICL 的优秀 LLM 的 HPA 能力，生成与前述瞬时评分器估计的良好对齐的回复，从而提升最终性能。

ICDPO：通过上下文中的直接偏好优化有效地借用他人的对齐能力

ICDPO: Effectively Borrowing Alignment Capability of Others via  In-context Direct Preference Optimization

Human preference alignment is a crucial training step to improve the
interaction quality of large language models (LLMs). Existing aligning methods
depend on manually annotated preference data to guide the LLM optimization
directions. However, in practice, continuously updating LLMs raises a
distribution gap between model-generated samples and human-preferred responses,
which hinders model fine-tuning efficiency. To mitigate this issue, previous
methods require additional preference annotation on generated samples to adapt
the shifted distribution, which consumes a large amount of annotation
resources. Targeting more efficient human preference optimization, we propose
an adversarial preference optimization (APO) framework, where the LLM agent and
the preference model update alternatively via a min-max game. Without
additional annotation, our APO method can make a self-adaption to the
generation distribution gap through the adversarial learning process. In
experiments, we empirically verify the effectiveness of APO in improving LLM's
helpfulness and harmlessness compared with rejection sampling baselines.

人类偏好对齐是提高大型语言模型交互质量的重要训练步骤。我们提出了一种对抗式偏好优化框架 (APO)，通过最小最大博弈的方式，使 LLM 代理和偏好模型交替更新，从而自适应地解决生成分布差异的问题，实验证明了 APO 在改善 LLM 的帮助性和无害性方面的有效性。