Large language models have demonstrated remarkable few-shot performance on
many natural language understanding tasks. Despite several demonstrations of
using large language models in complex, strategic scenarios, there lacks a
comprehensive framework for evaluating agents' performance across various types
of reasoning found in games. To address this gap, we introduce GameBench, a
cross-domain benchmark for evaluating strategic reasoning abilities of LLM
agents. We focus on 9 different game environments, where each covers at least
one axis of key reasoning skill identified in strategy games, and select games
for which strategy explanations are unlikely to form a significant portion of
models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base
form along with two scaffolding frameworks designed to enhance strategic
reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning
(RAP). Our results show that none of the tested models match human performance,
and at worse GPT-4 performs worse than random action. CoT and RAP both improve
scores but not comparable to human levels.

使用大型语言模型在游戏中评估策略推理能力的跨领域基准 (GameBench) 显示，虽然大多数测试模型并不及人类水平，但对策略推理能力的两种框架（CoT 和 RAP）能够提高分数。

GameBench：评估 LLM 代理的战略推理能力

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

We evaluate whether LLMs learn to make human-like preference judgements in
strategic scenarios as compared with known empirical results. We show that
Solar and Mistral exhibit stable value-based preference consistent with human
in the prisoner's dilemma, including stake-size effect, and traveler's dilemma,
including penalty-size effect. We establish a relationship between model size,
value based preference, and superficiality. Finally, we find that models that
tend to be less brittle were trained with sliding window attention.
Additionally, we contribute a novel method for constructing preference
relations from arbitrary LLMs and support for a hypothesis regarding human
behavior in the traveler's dilemma.

我们评估 LLMs 在战略场景中是否学会进行类似人类的偏好判断，结果显示 Solar 和 Mistral 表现出稳定的基于价值的偏好，包括与人类一致的囚徒困境和旅行者困境中的利益大小效应和罚款大小效应，我们发现模型的大小、基于价值的偏好和表面性之间存在关系，最后我们发现使用滑动窗口注意力训练的模型更加稳健，此外，我们提出了一种从任意 LLMs 构造偏好关系的新方法，并支持一个关于旅行者困境中人类行为的假设。