In real-world reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.

本文研究在控制系统中如何高效地进行强化学习，以应对代理无法实时观察系统最新状态的延迟和缺失观测，通过建立新的近似损失边界方法，可以在考虑状态-动作大小的情况下实现学习的高效性，与完全可观测性下的最优方案进行比较。

具有不完全可观测性的高效强化学习：学会通过延迟和缺失状态观测来行动