This paper provides a theoretical framework that validates and explains the results in the work with Bei Zhou experimentally finding that AlphaZero-style reinforcement learning algorithms struggle to learn optimal play in NIM, a canonical impartial game proposed as an AI challenge by Harvey Friedman in 2017. Our analysis resolves a controversy around these experimental results, which revealed unexpected difficulties in learning NIM despite its mathematical simplicity compared to games like chess and Go. Our key contributions are as follows: We prove that by incorporating recent game history, these limited AlphaZero models can, in principle, achieve optimal play in NIM. We introduce a novel search strategy where roll-outs preserve game-theoretic values during move selection, guided by a specialised policy network. We provide constructive proofs showing that our approach enables optimal play within the \(\text{AC}^0\) complexity class despite the theoretical limitations of these networks. This research demonstrates how constrained neural networks when properly designed, can achieve sophisticated decision-making even in domains where their basic computational capabilities appear insufficient.

本研究解决了AlphaZero风格强化学习算法在NIM这一公正游戏中学习最优策略的困难问题。我们提出通过考虑游戏的历史信息，可以使这些受限的AlphaZero模型在理论上实现NIM的最优玩法。研究结果显示，合理设计的受限神经网络能够在其基本计算能力看似不足的领域内实现复杂的决策制定。

利用弱神经网络掌握NIM和无偏游戏：一种类似AlphaZero的多帧方法