Action delays degrade the performance of reinforcement learning in many real-world systems. This paper proposes a formal definition of delay-aware Markov Decision Process and proves it can be transformed into standard MDP with augmented states using the Markov reward process. We develop a delay-aware model-based reinforcement learning framework that can incorporate the multi-step delay into the learned system models without learning effort. Experiments with the Gym and MuJoCo platforms show that the proposed delay-aware model-based algorithm is more efficient in training and transferable between systems with various durations of delay compared with off-policy model-free reinforcement learning methods. Codes available at: https://github.com/baimingc/dambrl.

该研究提出了延迟感知的马尔可夫决策过程的正式定义，并证明它可以通过使用马尔可夫奖励过程中的增强状态转化为标准MDP。我们开发了一个延迟感知的模型驱动强化学习框架，可以将多步延迟纳入学习到的系统模型中，而无需进行学习。 与Gym和MuJoCo平台进行的实验表明，与非策略模型无关的强化学习方法相比，所提出的延迟感知模型驱动算法在训练和各种延迟时间系统之间具有更高的效率和可传递性。

基于模型的延迟感知连续控制强化学习