Large language model alignment is widely used and studied to avoid LLM
producing unhelpful and harmful responses. However, the lengthy training
process and predefined preference bias hinder adaptation to online diverse
human preferences. To this end, this paper proposes an alignment framework,
called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by
directly leveraging real online human behaviors. By taking the generative
adversarial framework, the generator is trained to respond following expected
human behavior; while the discriminator tries to verify whether the triplets of
query, response, and human behavior come from real online environments.
Behavior modeling in natural-language form and the multi-model joint training
mechanism enable an active and sustainable online alignment. Experimental
results confirm the effectiveness of our proposed methods by both human and
automatic evaluations.

本文提出了一种对齐框架，名为人类行为强化学习（RLHB），通过直接利用真实的在线人类行为来对齐大型语言模型，并采用生成对抗框架训练生成器按照预期的人类行为进行回复，鉴别器则验证查询、回复和人类行为三元组是否来自真实的在线环境，在自然语言形式的行为模型和多模型联合训练机制的支持下，实现了积极可持续的在线对齐。通过人工和自动评估，实验证实了本文方法的有效性。

真实为贵：将大型语言模型与在线人类行为对齐

The Real, the Better: Aligning Large Language Models with Online Human  Behaviors

Symbolic Music Alignment is the process of matching performed MIDI notes to
corresponding score notes. In this paper, we introduce a reinforcement learning
(RL)-based online symbolic music alignment technique. The RL agent - an
attention-based neural network - iteratively estimates the current score
position from local score and performance contexts. For this symbolic alignment
task, environment states can be sampled exhaustively and the reward is dense,
rendering a formulation as a simplified offline RL problem straightforward. We
evaluate the trained agent in three ways. First, in its capacity to identify
correct score positions for sampled test contexts; second, as the core
technique of a complete algorithm for symbolic online note-wise alignment; and
finally, as a real-time symbolic score follower. We further investigate the
pitch-based score and performance representations used as the agent's inputs.
To this end, we develop a second model, a two-step Dynamic Time Warping
(DTW)-based offline alignment algorithm leveraging the same input
representation. The proposed model outperforms a state-of-the-art reference
model of offline symbolic music alignment.

该研究介绍了一种基于强化学习的在线符号音乐对齐技术，利用注意力机制的神经网络估计乐谱位置，并通过三种方式进行评估，优于当前最先进的离线符号音乐对齐模型。

基于离线强化学习的在线符号音乐对齐

Online Symbolic Music Alignment with Offline Reinforcement Learning

Online alignment in machine translation refers to the task of aligning a
target word to a source word when the target sequence has only been partially
decoded. Good online alignments facilitate important applications such as
lexically constrained translation where user-defined dictionaries are used to
inject lexical constraints into the translation model. We propose a novel
posterior alignment technique that is truly online in its execution and
superior in terms of alignment error rates compared to existing methods. Our
proposed inference technique jointly considers alignment and token
probabilities in a principled manner and can be seamlessly integrated within
existing constrained beam-search decoding algorithms. On five language pairs,
including two distant language pairs, we achieve consistent drop in alignment
error rates. When deployed on seven lexically constrained translation tasks, we
achieve significant improvements in BLEU specifically around the constrained
positions.

本文介绍了一种在线翻译对齐技术，该技术可以帮助用户将自定义的字典注入到翻译模型中，并可以与之前的约束性搜索技术集成，从而有效解决了机器翻译中的对齐问题。经实验证明，在 5 种不同的语言对和 7 项翻译任务中，对齐错误率显著降低，BLEU 得分有相应提高。