Large Language Models (LLMs) have demonstrated great potential in
Conversational Recommender Systems (CRS). However, the application of LLMs to
CRS has exposed a notable discrepancy in behavior between LLM-based CRS and
human recommenders: LLMs often appear inflexible and passive, frequently
rushing to complete the recommendation task without sufficient inquiry.This
behavior discrepancy can lead to decreased accuracy in recommendations and
lower user satisfaction. Despite its importance, existing studies in CRS lack a
study about how to measure such behavior discrepancy. To fill this gap, we
propose Behavior Alignment, a new evaluation metric to measure how well the
recommendation strategies made by a LLM-based CRS are consistent with human
recommenders'. Our experiment results show that the new metric is better
aligned with human preferences and can better differentiate how systems perform
than existing evaluation metrics. As Behavior Alignment requires explicit and
costly human annotations on the recommendation strategies, we also propose a
classification-based method to implicitly measure the Behavior Alignment based
on the responses. The evaluation results confirm the robustness of the method.

基于大型语言模型的对话式推荐系统在行为一致性方面存在差异，本研究提出了行为一致性评估指标 Behavior Alignment，该指标通过与人类推荐者进行对比能够更好地衡量系统性能，并且提出了一种基于分类的隐式测量方法来评估该指标的鲁棒性。

行为对齐：评估基于 LLM 的对话推荐系统的新视角

Behavior Alignment: A New Perspective of Evaluating LLM-based  Conversational Recommendation Systems

Designing reward functions for efficiently guiding reinforcement learning
(RL) agents toward specific behaviors is a complex task. This is challenging
since it requires the identification of reward structures that are not sparse
and that avoid inadvertently inducing undesirable behaviors. Naively modifying
the reward structure to offer denser and more frequent feedback can lead to
unintended outcomes and promote behaviors that are not aligned with the
designer's intended goal. Although potential-based reward shaping is often
suggested as a remedy, we systematically investigate settings where deploying
it often significantly impairs performance. To address these issues, we
introduce a new framework that uses a bi-level objective to learn
\emph{behavior alignment reward functions}. These functions integrate auxiliary
rewards reflecting a designer's heuristics and domain knowledge with the
environment's primary rewards. Our approach automatically determines the most
effective way to blend these types of feedback, thereby enhancing robustness
against heuristic reward misspecification. Remarkably, it can also adapt an
agent's policy optimization process to mitigate suboptimalities resulting from
limitations and biases inherent in the underlying RL algorithms. We evaluate
our method's efficacy on a diverse set of tasks, from small-scale experiments
to high-dimensional control challenges. We investigate heuristic auxiliary
rewards of varying quality -- some of which are beneficial and others
detrimental to the learning process. Our results show that our framework offers
a robust and principled way to integrate designer-specified heuristics. It not
only addresses key shortcomings of existing approaches but also consistently
leads to high-performing solutions, even when given misaligned or
poorly-specified auxiliary reward functions.

通过使用双层目标的新框架，将辅助奖励与环境的主要奖励相结合，我们提供了一种集成设计者指定的启发式方法的鲁棒且有原则的方式，以解决现有方法的主要缺点，即使给出不对齐或指定不良的辅助奖励函数，也能始终导致高性能解决方案。

通过奖励函数优化进行行为对齐

Behavior Alignment via Reward Function Optimization

The emergence of large language models (LLMs) has sparked significant
interest in extending their remarkable language capabilities to speech.
However, modality alignment between speech and text still remains an open
problem. Current solutions can be categorized into two strategies. One is a
cascaded approach where outputs (tokens or states) of a separately trained
speech recognition system are used as inputs for LLMs, which limits their
potential in modeling alignment between speech and text. The other is an
end-to-end approach that relies on speech instruction data, which is very
difficult to collect in large quantities. In this paper, we address these
issues and propose the BLSP approach that Bootstraps Language-Speech
Pre-training via behavior alignment of continuation writing. We achieve this by
learning a lightweight modality adapter between a frozen speech encoder and an
LLM, ensuring that the LLM exhibits the same generation behavior regardless of
the modality of input: a speech segment or its transcript. The training process
can be divided into two steps. The first step prompts an LLM to generate texts
with speech transcripts as prefixes, obtaining text continuations. In the
second step, these continuations are used as supervised signals to train the
modality adapter in an end-to-end manner. We demonstrate that this
straightforward process can extend the capabilities of LLMs to speech, enabling
speech recognition, speech translation, spoken language understanding, and
speech conversation, even in zero-shot cross-lingual scenarios.

通过行为对齐的方式，我们提出了一种轻量级的语言 - 语音预训练方法，将大型语言模型（LLMs）的能力扩展到语音识别、语音翻译、口语理解和对话等领域，实现了语音和文本之间的模态对齐。