Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic abilities remain largely unexplored. Game theory provides a good framework for assessing the decision-making abilities of LLMs in interactions with other agents. Although prior studies have shown that LLMs can solve these tasks with carefully curated prompts, they fail when the problem setting or prompt changes. In this work we investigate LLMs' behaviour in strategic games, Stag Hunt and Prisoner Dilemma, analyzing performance variations under different settings and prompts. Our results show that the tested state-of-the-art LLMs exhibit at least one of the following systematic biases: (1) positional bias, (2) payoff bias, or (3) behavioural bias. Subsequently, we observed that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. Performance is assessed based on the selection of the correct action, one which agrees with the prompted preferred behaviours of both players. Alignment refers to whether the LLM's bias aligns with the correct action. For example, GPT-4o's average performance drops by 34% when misaligned. Additionally, the current trend of "bigger and newer is better" does not hold for the above, where GPT-4o (the current best-performing LLM) suffers the most substantial performance drop. Lastly, we note that while chain-of-thought prompting does reduce the effect of the biases on most models, it is far from solving the problem at the fundamental level.

调研表明，尽管Large Language Models（LLMs）能够以精心策划的提示解决特定任务，但在问题设置或提示改变时，它们表现出偏向不同的策略，导致性能下降。因此，我们研究了LLMs在战略游戏中的行为，分析了不同设置和提示下的性能变化，并发现它们存在至少一种系统性偏向，即(1) 位置偏向，(2) 收益偏向或(3) 行为偏向。此外，我们观察到LLMs的偏向与正确动作是否一致会影响它们的表现。然而，当前流行的追求“更大、更新”的趋势在此领域不适用，目前最佳表现的LLM（GPT-4o）的性能下降最为显著。最后，我们注意到，尽管思维链提示确实减少了对大多数模型的偏向影响，但在根本上解决这个问题仍然存在困难。

大规模语言模型是否是战略决策者？两人非零和博弈中的性能与偏差研究