The agent learns to organize decision behavior to achieve a behavioral goal, such as reward maximization, and reinforcement learning is often used for this optimization. Learning an optimal behavioral strategy is difficult under the uncertainty that events necessary for learning are only partially observable, called as Partially Observable Markov Decision Process (POMDP). However, the real-world environment also gives many events irrelevant to reward delivery and an optimal behavioral strategy. The conventional methods in POMDP, which attempt to infer transition rules among the entire observations, including irrelevant states, are ineffective in such an environment. Supposing Redundantly Observable Markov Decision Process (ROMDP), here we propose a method for goal-oriented reinforcement learning to efficiently learn state transition rules among reward-related "core states'' from redundant observations. Starting with a small number of initial core states, our model gradually adds new core states to the transition diagram until it achieves an optimal behavioral strategy consistent with the Bellman equation. We demonstrate that the resultant inference model outperforms the conventional method for POMDP. We emphasize that our model only containing the core states has high explainability. Furthermore, the proposed method suits online learning as it suppresses memory consumption and improves learning speed.

通过观察其余状态以有效学习核心状态之间的状态转移规则，针对部分可观测马尔科夫决策过程(POMDP)提出一种面向目标的强化学习方法。 在逐步添加新的核心状态到转换图中的同时，本模型仅包含核心状态，它监督一小部分核心状态以了解动态环境并获得最佳行为策略，这使其具有良好的可解释性。 此外，该方法适用于在线学习，可以抑制内存消耗并提高学习速度。

基于目标的冗余观测环境推断