It has been established in recent work that Large Language Models (LLMs) can
be prompted to "self-play" conversational games that probe certain capabilities
(general instruction following, strategic goal orientation, language
understanding abilities), where the resulting interactive game play can be
automatically scored. In this paper, we take one of the proposed frameworks for
setting up such game-play environments, and further test its usefulness as an
evaluation instrument, along a number of dimensions: We show that it can easily
keep up with new developments while avoiding data contamination, we show that
the tests implemented within it are not yet saturated (human performance is
substantially higher than that of even the best models), and we show that it
lends itself to investigating additional questions, such as the impact of the
prompting language on performance. We believe that the approach forms a good
basis for making decisions on model choice for building applied interactive
systems, and perhaps ultimately setting up a closed-loop development
environment of system and simulated evaluator.

利用大型语言模型自我对弈进行对话游戏的研究，旨在探索其普适性、评估模型的性能，并研究提示语言对模型表现的影响。该研究为构建应用交互系统的模型选择提供了基础，或最终建立模型和模拟评估器的闭环开发环境。