In partially observable (PO) environments, deep reinforcement learning (RL)
agents often suffer from unsatisfactory performance, since two problems need to
be tackled together: how to extract information from the raw observations to
solve the task, and how to improve the policy. In thi