In this work, we adapt a training approach inspired by the original AlphaGo system to play the imperfect information game of Reconnaissance Blind Chess. Using only the observations instead of a full description of the game state, we first train a supervised agent on publicly available game records. Next, we increase the performance of the agent through self-play with the on-policy reinforcement learning algorithm Proximal Policy Optimization. We do not use any search to avoid problems caused by the partial observability of game states and only use the policy network to generate moves when playing. With this approach, we achieve an ELO of 1330 on the RBC leaderboard, which places our agent at position 27 at the time of this writing. We see that self-play significantly improves performance and that the agent plays acceptably well without search and without making assumptions about the true game state.

本研究采用AlphaGo启发的训练方法来玩不完全信息的侦察盲棋，通过自我对弈与PP0强化学习算法来提高非监督代理性能，用此方法在RBC排行榜上达到1330 ELO，排名第27，证明了自我对弈对性能的显著提升，而不使用搜索和对真实游戏状态的假设也能使代理表现得相当不错。

侦察盲棋中的观测监督学习和强化学习