Large language models (LLMs) can significantly be improved by aligning to
human preferences -- the so-called reinforcement learning from human feedback
(RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users.
Due to their ability to bypass LLM finetuning, tokenwise reward-guided text
generation (RGTG) methods have recently been proposed. They use a reward model
trained on full sequences to score partial sequences during a tokenwise
decoding, in a bid to steer the generation towards sequences with high rewards.
However, these methods have so far been only heuristically motivated and poorly
analyzed. In this work, we show that reward models trained on full sequences
are not compatible with scoring partial sequences. To alleviate this issue, we
propose to explicitly train a Bradley-Terry reward model on partial sequences,
and autoregressively sample from the implied tokenwise policy during decoding
time. We study the property of this reward model and the implied policy. In
particular, we show that this policy is proportional to the ratio of two
distinct RLHF policies. We show that our simple approach outperforms previous
RGTG methods and achieves similar performance as strong offline baselines but
without large-scale LLM finetuning.

大型语言模型（LLMs）可以通过人类喜好的对齐显著提高，即所谓的来自人类反馈的强化学习（RLHF）。然而，对于许多用户而言，微调 LLM 的成本是不可接受的。最近提出的逐标记奖励引导的文本生成（RGTG）方法可绕过 LLM 微调，它们使用在完整序列上训练的奖励模型来评分在标记级解码期间的部分序列，以引导生成高奖励序列。然而，这些方法迄今为止仅仅是启发式的动机和贫乏的分析。在这项工作中，我们展示了在全序列上训练的奖励模型与评分部分序列不兼容。为缓解这个问题，我们建议明确地在部分序列上训练 Bradley-Terry 奖励模型，并且在解码时从隐含的标记级策略中自回归采样。我们研究了这个奖励模型和隐含策略的性质。特别地，我们展示了这个策略与两个不同的 RLHF 策略之比成正比。我们展示了我们简单的方法优于之前的 RGTG 方法，并且在没有大规模 LLM 微调的情况下实现了与强大的线下基准的类似性能。

对基于单词奖励引导的文本生成进行批判性研究

A Critical Look At Tokenwise Reward-Guided Text Generation

Large language models (LLMs) have demonstrated impressive capabilities in
mathematical problem solving, particularly in single turn question answering
formats. However, real world scenarios often involve mathematical question
answering that requires multi turn or interactive information exchanges, and
the performance of LLMs on these tasks is still underexplored. This paper
introduces MathChat, a comprehensive benchmark specifically designed to
evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are
structured to assess the models' abilities in multiturn interactions and open
ended generation. We evaluate the performance of various SOTA LLMs on the
MathChat benchmark, and we observe that while these models excel in single turn
question answering, they significantly underperform in more complex scenarios
that require sustained reasoning and dialogue understanding. To address the
above limitations of existing LLMs when faced with multiturn and open ended
tasks, we develop MathChat sync, a synthetic dialogue based math dataset for
LLM finetuning, focusing on improving models' interaction and instruction
following capabilities in conversations. Experimental results emphasize the
need for training LLMs with diverse, conversational instruction tuning datasets
like MathChatsync. We believe this work outlines one promising direction for
improving the multiturn mathematical reasoning abilities of LLMs, thus pushing
forward the development of LLMs that are more adept at interactive mathematical
problem solving and real world applications.

这篇论文介绍了一个专门设计用来评估大型语言模型在更广泛的数学任务上的 MathChat 基准测试，并观察到这些模型在单回合问题回答方面表现出色，但在需要持续推理和对话理解的复杂场景下性能显著下降。通过开发 MathChat sync 这样一个用于提升模型对话能力和指令跟随能力的合成对话型数学数据集，实验结果强调了使用类似 MathChat sync 这样多样化的对话指令微调数据集训练大型语言模型的必要性。作者认为这项工作为改进大型语言模型的多轮数学推理能力指明了一个有希望的方向，推动了更擅长交互式数学问题解决和实际应用的大型语言模型的发展。

MathChat：多轮交互中数学推理和指令遵循的基准评估

MathChat: Benchmarking Mathematical Reasoning and Instruction Following  in Multi-Turn Interactions

Recent end-to-end approaches have shown promise in extending large language
models (LLMs) to speech inputs, but face limitations in directly assessing and
optimizing alignment quality and fail to achieve fine-grained alignment due to
speech-text length mismatch. We introduce BLSP-KD, a novel approach for
Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which
addresses these limitations through two key techniques. First, it optimizes
speech-text alignment by minimizing the divergence between the LLM's next-token
prediction distributions for speech and text inputs using knowledge
distillation. Second, it employs a continuous-integrate-andfire strategy to
segment speech into tokens that correspond one-to-one with text tokens,
enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new
adaptation method supporting LLM finetuning for speech inputs under knowledge
distillation. Quantitative evaluation shows that BLSP-KD outperforms previous
end-to-end baselines and cascaded systems with comparable scale of parameters,
facilitating general instruction-following capabilities for LLMs with speech
inputs. This approach provides new possibilities for extending LLMs to spoken
language interactions.

通过知识蒸馏，BLSP-KD 通过两个关键技术来优化语音 - 文本对齐质量，实现细粒度对齐，同时还引入了 LLM 的适应方法 PLoRA，通过定量评估说明了 BLSP-KD 在扩展 LLMs 到口语交互方面的优势。