Recent advancements in Large Language Models (LLMs) have enhanced the
efficacy of agent communication and social interactions. Despite these
advancements, building LLM-based agents for reasoning in dynamic environments
involving competition and collaboration remains challenging due to the
limitations of informed graph-based search methods. We propose PLAYER*, a novel
framework based on an anytime sampling-based planner, which utilises sensors
and pruners to enable a purely question-driven searching framework for complex
reasoning tasks. We also introduce a quantifiable evaluation method using
multiple-choice questions and construct the WellPlay dataset with 1,482 QA
pairs. Experiments demonstrate PLAYER*'s efficiency and performance
enhancements compared to existing methods in complex, dynamic environments with
quantifiable results.

基于大型语言模型（LLM）的代理通信和社交互动的最新进展，尽管这些进展，但在涉及竞争和协作的动态环境中构建面向推理的 LLM 代理仍然具有挑战性，由于受到知情图搜索方法的局限性。我们提出了 PLAYER*，这是一种基于任意采样的规划器的新型框架，它利用传感器和修剪器，为复杂的推理任务提供了一个纯问题驱动的搜索框架。我们还引入了一种可量化的评估方法，使用多项选择题构建了包含 1,482 个问答对的 WellPlay 数据集。实验证明，与现有方法相比，PLAYER * 在具有可量化结果的复杂动态环境中提供了效率和性能的改进。

PLAYER*: 提高基于 LLM 的多智能体通讯与互动在谋杀推理游戏中的效果

PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction  in Murder Mystery Games

Training agents to communicate with one another given task-based supervision
only has attracted considerable attention recently, due to the growing interest
in developing models for human-agent interaction. Prior work on the topic
focused on simple environments, where training using policy gradient was
feasible despite the non-stationarity of the agents during training. In this
paper, we present a more challenging environment for testing the emergence of
communication from raw pixels, where training using policy gradient fails. We
propose a new model and training algorithm, that utilizes the structure of a
learned representation space to produce more consistent speakers at the initial
phases of training, which stabilizes learning. We empirically show that our
algorithm substantially improves performance compared to policy gradient. We
also propose a new alignment-based metric for measuring context-independence in
emerged communication and find our method increases context-independence
compared to policy gradient and other competitive baselines.

本文提出了一种新的模型和训练算法，在以原始像素为输入的环境中，利用学习到的表征空间的结构产生更一致的发言者，稳定学习，并提出了一种用于度量文本独立性的新型基于对齐的指标，相比策略梯度和其他竞争基线，我们的算法在通信效果方面有了大幅提升。