Effectively aligning Large Language Models (LLMs) with human-centric values
while preventing the degradation of abilities acquired through Pre-training and
Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement
Learning from Human Feedback (RLHF). In this paper, we first discover that
interpolating RLHF and SFT model parameters can adjust the trade-off between
human preference and basic capabilities, thereby reducing the alignment tax at
the cost of alignment reward. Inspired by this, we propose integrating the RL
policy and SFT models at each optimization step in RLHF to continuously
regulate the training direction, introducing the Online Merging Optimizer.
Specifically, we merge gradients with the parameter differences between SFT and
pretrained models, effectively steering the gradient towards maximizing rewards
in the direction of SFT optimization. We demonstrate that our optimizer works
well with different LLM families, such as Qwen and LLaMA, across various model
sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and
existing model merging methods. It significantly enhances alignment reward
while mitigating alignment tax, achieving higher overall performance across 14
benchmarks.

通过在线合并优化器，在人类反馈强化学习中持续调节训练方向，实现大语言模型的高性能表现和对齐奖励的显著提升，同时减小对齐成本。

在线合并优化器用于提升回报和降低税额的对齐

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in  Alignment

Point cloud registration is a common step in many 3D computer vision tasks
such as object pose estimation, where a 3D model is aligned to an observation.
Classical registration methods generalize well to novel domains but fail when
given a noisy observation or a bad initialization. Learning-based methods, in
contrast, are more robust but lack in generalization capacity. We propose to
consider iterative point cloud registration as a reinforcement learning task
and, to this end, present a novel registration agent (ReAgent). We employ
imitation learning to initialize its discrete registration policy based on a
steady expert policy. Integration with policy optimization, based on our
proposed alignment reward, further improves the agent's registration
performance. We compare our approach to classical and learning-based
registration methods on both ModelNet40 (synthetic) and ScanObjectNN (real
data) and show that our ReAgent achieves state-of-the-art accuracy. The
lightweight architecture of the agent, moreover, enables reduced inference time
as compared to related approaches. In addition, we apply our method to the
object pose estimation task on real data (LINEMOD), outperforming
state-of-the-art pose refinement approaches.

本篇论文提出了一种基于强化学习的点云迭代配准算法（ReAgent），通过引入一个新的对齐奖励函数进行多策略融合优化，显著提高了算法的配准性能，实验证明该算法在 ModelNet40 和 ScanObjectNN 数据集上均具备优秀的性能，在实际物体位姿估计任务（LINEMOD 数据集）中也实现了比现有算法更精确的结果。