Large language models have demonstrated remarkable few-shot performance on
many natural language understanding tasks. Despite several demonstrations of
using large language models in complex, strategic scenarios, there lacks a
comprehensive framework for evaluating agents' performance across various types
of reasoning found in games. To address this gap, we introduce GameBench, a
cross-domain benchmark for evaluating strategic reasoning abilities of LLM
agents. We focus on 9 different game environments, where each covers at least
one axis of key reasoning skill identified in strategy games, and select games
for which strategy explanations are unlikely to form a significant portion of
models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base
form along with two scaffolding frameworks designed to enhance strategic
reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning
(RAP). Our results show that none of the tested models match human performance,
and at worse GPT-4 performs worse than random action. CoT and RAP both improve
scores but not comparable to human levels.

使用大型语言模型在游戏中评估策略推理能力的跨领域基准 (GameBench) 显示，虽然大多数测试模型并不及人类水平，但对策略推理能力的两种框架（CoT 和 RAP）能够提高分数。

GameBench：评估 LLM 代理的战略推理能力

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Decision-making problems, categorized as single-agent, e.g., Atari,
cooperative multi-agent, e.g., Hanabi, competitive multi-agent, e.g., Hold'em
poker, and mixed cooperative and competitive, e.g., football, are ubiquitous in
the real world. Various methods are proposed to address the specific
decision-making problems. Despite the successes in specific categories, these
methods typically evolve independently and cannot generalize to other
categories. Therefore, a fundamental question for decision-making is: \emph{Can
we develop \textbf{a single algorithm} to tackle \textbf{ALL} categories of
decision-making problems?} There are several main challenges to address this
question: i) different decision-making categories involve different numbers of
agents and different relationships between agents, ii) different categories
have different solution concepts and evaluation measures, and iii) there lacks
a comprehensive benchmark covering all the categories. This work presents a
preliminary attempt to address the question with three main contributions. i)
We propose the generalized mirror descent (GMD), a generalization of MD
variants, which considers multiple historical policies and works with a broader
class of Bregman divergences. ii) We propose the configurable mirror descent
(CMD) where a meta-controller is introduced to dynamically adjust the
hyper-parameters in GMD conditional on the evaluation measures. iii) We
construct the \textsc{GameBench} with 15 academic-friendly games across
different decision-making categories. Extensive experiments demonstrate that
CMD achieves empirically competitive or better outcomes compared to baselines
while providing the capability of exploring diverse dimensions of decision
making.

这篇论文旨在探索是否能开发一种单一算法以解决所有决策问题的分类。它通过引入广义镜像下降（GMD）、可配置镜像下降（CMD）和 GameBench 构建等方法来应对不同决策问题的挑战，并通过广泛的实验验证 CMD 在各个维度上对决策问题进行了具有竞争力的表现。