The black-box nature of deep reinforcement learning (RL) hinders them from real-world applications. Therefore, interpreting and explaining RL agents have been active research topics in recent years. Existing methods for post-hoc explanations usually adopt the action matching principle to enable an easy understanding of vision-based RL agents. In this paper, it is argued that the commonly used action matching principle is more like an explanation of deep neural networks (DNNs) than the interpretation of RL agents. It may lead to irrelevant or misplaced feature attribution when different DNNs' outputs lead to the same rewards or different rewards result from the same outputs. Therefore, we propose to consider rewards, the essential objective of RL agents, as the essential objective of interpreting RL agents as well. To ensure reward consistency during interpretable feature discovery, a novel framework (RL interpreting RL, denoted as RL-in-RL) is proposed to solve the gradient disconnection from actions to rewards. We verify and evaluate our method on the Atari 2600 games as well as Duckietown, a challenging self-driving car simulator environment. The results show that our method manages to keep reward (or return) consistency and achieves high-quality feature attribution. Further, a series of analytical experiments validate our assumption of the action matching principle's limitations.

通过提出奖励一致性和特征归因作为理解强化学习（RL）代理的中心目标，并提出了一种新的框架（RL在RL中，简称RL-in-RL）来解决梯度从动作到奖励的脱节问题，该文对Atari 2600游戏以及Duckietown进行了验证和评估，结果表明我们的方法能够保持奖励一致性并实现高质量的特征归因，同时一系列的分析实验证实了我们对行动匹配原则限制的假设。

利用奖励一致性进行强化学习中可解释特征发现