Adapting Large Language Models (LLMs) for agent tasks is critical in
developing language agents. Direct Preference Optimization (DPO) is a promising
technique for this adaptation with the alleviation of compounding errors,
offering a means to directly optimize Reinforcement Learning (RL) objectives.
However, applying DPO to multi-turn tasks presents challenges due to the
inability to cancel the partition function. Overcoming this obstacle involves
making the partition function independent of the current state and addressing
length disparities between preferred and dis-preferred trajectories. In this
light, we replace the policy constraint with the state-action occupancy measure
constraint in the RL objective and add length normalization to the
Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn
agent tasks with theoretical explanations. Extensive experiments on three
multi-turn agent task datasets confirm the effectiveness and superiority of the
DMPO loss.

通过使用 DMPO 损失函数，对多回合任务中的大型语言模型（LLMs）进行适应，可以优化强化学习（RL）目标并提供理论解释。实验证明 DMPO 损失的有效性和优越性。

语言代理的多轮偏好直接优化

Direct Multi-Turn Preference Optimization for Language Agents

Instructions-tuned Large Language Models (LLMs) gained recently huge
popularity thanks to their ability to interact with users through conversation.
In this work we aim to evaluate their ability to complete multi-turn tasks and
interact with external databases in the context of established task-oriented
dialogue benchmarks. We show that for explicit belief state tracking, LLMs
underperform compared to specialized task-specific models. Nevertheless, they
show ability to guide the dialogue to successful ending if given correct slot
values. Furthermore this ability improves with access to true belief state
distribution or in-domain examples.

本研究旨在研究大型语言模型在多轮任务和与外部数据库交互方面的能力，发现在显式信仰状态跟踪方面，它们表现不如专门的任务特定模型，但是如果给出正确的插槽值，它们表现出将对话引导到成功结局的能力，并且在有真实信仰状态分布或域内示例的情况下，这种能力得到了改进。