Real-world decision-making problems are often partially observable, and many
can be formulated as a Partially Observable Markov Decision Process (POMDP).
When we apply reinforcement learning (RL) algorithms to the POMDP, reasonable
estimation of the hidden states can help solve the problems. Furthermore,
explainable decision-making is preferable, considering their application to
real-world tasks such as autonomous driving cars. We proposed an RL algorithm
that estimates the hidden states by end-to-end training, and visualize the
estimation as a state-transition graph. Experimental results demonstrated that
the proposed algorithm can solve simple POMDP problems and that the
visualization makes the agent's behavior interpretable to humans.

一个 RL 算法，可以通过端到端训练来估算隐藏状态，并将估算可视化为状态转换图。 实验结果表明，该算法可以解决简单的 POMDP 问题，并使代理行为可解释给人类。

POMDPs 和可解释的代理的端到端策略梯度方法

End-to-End Policy Gradient Method for POMDPs and Explainable Agents

The Laplacian representation recently gains increasing attention for
reinforcement learning as it provides succinct and informative representation
for states, by taking the eigenvectors of the Laplacian matrix of the
state-transition graph as state embeddings. Such representation captures the
geometry of the underlying state space and is beneficial to RL tasks such as
option discovery and reward shaping. To approximate the Laplacian
representation in large (or even continuous) state spaces, recent works propose
to minimize a spectral graph drawing objective, which however has infinitely
many global minimizers other than the eigenvectors. As a result, their learned
Laplacian representation may differ from the ground truth. To solve this
problem, we reformulate the graph drawing objective into a generalized form and
derive a new learning objective, which is proved to have eigenvectors as its
unique global minimizer. It enables learning high-quality Laplacian
representations that faithfully approximate the ground truth. We validate this
via comprehensive experiments on a set of gridworld and continuous control
environments. Moreover, we show that our learned Laplacian representations lead
to more exploratory options and better reward shaping.

该研究探讨了利用 Laplacian 矩阵对状态进行编码的问题，本文提出了一种新的学习方法， 可以为大规模状态空间的强化学习任务提供高质量的 Laplacian 表示，从而产生更好的奖励塑形和探索性选择。