In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved
state-of-the-art performance in many challenging strategy games. Because these
games have complicated rules, an action sampled from the full discrete action
distribution predicted by the learned policy is likely to be invalid according
to the game rules (e.g., walking into a wall). The usual approach to deal with
this problem in policy gradient algorithms is to "mask out" invalid actions and
just sample from the set of valid actions. The implications of this process,
however, remain under-investigated. In this paper, we 1) show theoretical
justification for such a practice, 2) empirically demonstrate its importance as
the space of invalid actions grows, and 3) provide further insights by
evaluating different action masking regimes, such as removing masking after an
agent has been trained using masking. The source code can be found at
this https URL

本文研究探讨针对复杂的规则游戏，使用深度强化学习算法时，如何解决学习出的策略生成的无效动作问题，给出了合理的理论支持，实证了有效性，并给出了不同的行动遮罩方案的评估。