The generalization of decision-making agents encompasses two fundamental
elements: learning from past experiences and reasoning in novel contexts.
However, the predominant emphasis in most interactive environments is on
learning, often at the expense of complexity in reasoning. In this paper, we
introduce CivRealm, an environment inspired by the Civilization game.
Civilization's profound alignment with human history and society necessitates
sophisticated learning, while its ever-changing situations demand strong
reasoning to generalize. Particularly, CivRealm sets up an
imperfect-information general-sum game with a changing number of players; it
presents a plethora of complex features, challenging the agent to deal with
open-ended stochastic environments that require diplomacy and negotiation
skills. Within CivRealm, we provide interfaces for two typical agent types:
tensor-based agents that focus on learning, and language-based agents that
emphasize reasoning. To catalyze further research, we present initial results
for both paradigms. The canonical RL-based agents exhibit reasonable
performance in mini-games, whereas both RL- and LLM-based agents struggle to
make substantial progress in the full game. Overall, CivRealm stands as a
unique learning and reasoning challenge for decision-making agents. The code is
available at this https URL

通过 CivRealm 环境，本论文介绍了决策代理的学习和推理两个基本要素，以及在互动环境中学习和推理之间的平衡问题。

CivRealm: 文明中的学习和推理之旅用于决策智能体

CivRealm: A Learning and Reasoning Odyssey in Civilization for  Decision-Making Agents

Large Language Models (LLMs) are becoming increasingly smart and autonomous,
targeting real-world pragmatic missions beyond traditional NLP tasks. As a
result, there has been an urgent need to evaluate LLMs as agents on challenging
tasks in interactive environments. We present AgentBench, a multi-dimensional
evolving benchmark that currently consists of 8 distinct environments to assess
LLM-as-Agent's reasoning and decision-making abilities in a multi-turn
open-ended generation setting. Our extensive test over 25 LLMs (including APIs
and open-sourced models) shows that, while top commercial LLMs present a strong
ability of acting as agents in complex environments, there is a significant
disparity in performance between them and open-sourced competitors. It also
serves as a component of an ongoing project with wider coverage and deeper
consideration towards systematic LLM evaluation. Datasets, environments, and an
integrated evaluation package for AgentBench are released at
this https URL

大型语言模型在互动环境中以多轮开放式生成的方式评估 LLMs 作为代理的推理和决策能力，显示出商业 LLMs 和开源竞争对手之间的性能差距。

AgentBench: 评估语言模型为代理人

AgentBench: Evaluating LLMs as Agents

Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of
grounding high-level tasks, expressed in natural language (e.g. "make
breakfast"), to a chosen set of actionable steps (e.g. "open fridge"). While
prior work focused on learning from explicit step-by-step examples of how to
act, we surprisingly find that if pre-trained LMs are large enough and prompted
appropriately, they can effectively decompose high-level tasks into mid-level
plans without any further training. However, the plans produced naively by LLMs
often cannot map precisely to admissible actions. We propose a procedure that
conditions on existing demonstrations and semantically translates the plans to
admissible actions. Our evaluation in the recent VirtualHome environment shows
that the resulting method substantially improves executability over the LLM
baseline. The conducted human evaluation reveals a trade-off between
executability and correctness but shows a promising sign towards extracting
actionable knowledge from language models. Website at
this https URL

本文研究大型语言模型在互动环境中是否可以利用所学的世界知识来执行高层任务，并提出了一种条件方法，将语言模型生成的中级计划语义上翻译为合适的操作以提高执行性能。在 VirtualHome 环境中的实证评估结果表明，该方法在可执行性方面显著优于大型语言模型基线。

语言模型作为零 - shot 规划器：提取行动知识用于具身代理

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Finding features that disentangle the different causes of variation in real
data is a difficult task, that has nonetheless received considerable attention
in static domains like natural images. Interactive environments, in which an
agent can deliberately take actions, offer an opportunity to tackle this task
better, because the agent can experiment with different actions and observe
their effects. We introduce the idea that in interactive environments, latent
factors that control the variation in observed data can be identified by
figuring out what the agent can control. We propose a naive method to find
factors that explain or measure the effect of the actions of a learner, and
test it in illustrative experiments.

本文旨在研究如何在交互环境中通过找出学习者可控制的因素来寻找控制观察数据变异因素的因素，提出了一种新颖的方法并在实验中进行了测试。