Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.

本文介绍了如何使用CheckList方法对在线树搜索策略的强化学习代理进行测试，以更好的评估其未来性能并帮助开发人员发现代理的推理缺陷，所述方法通过用户界面和通用查询规则机制实现。研究结果表明，该方法有效地帮助用户发现代理推理中的未知缺陷，同时可帮助改进未来的应用及相关开发。

超越价值：基于规划的强化学习推理测试清单