This paper proposes a policy learning algorithm based on the Koopman operator theory and policy gradient approach, which seeks to approximate an unknown dynamical system and search for optimal policy simultaneously, using the observations gathered through interaction with the environment. The proposed algorithm has two innovations: first, it introduces the so-called deep Koopman representation into the policy gradient to achieve a linear approximation of the unknown dynamical system, all with the purpose of improving data efficiency; second, the accumulated errors for long-term tasks induced by approximating system dynamics are avoided by applying Bellman's principle of optimality. Furthermore, a theoretical analysis is provided to prove the asymptotic convergence of the proposed algorithm and characterize the corresponding sampling complexity. These conclusions are also supported by simulations on several challenging benchmark environments.

本文提出了一种基于Koopman算子理论和策略梯度方法的政策学习算法，该算法将未知动态系统的线性逼近和最优政策搜索相结合，引入所谓的深度Koopman表示来提高数据效率，并应用贝尔曼最优原理来避免逼近系统动态引起的长期任务的累积误差，同时提供理论分析以证明所提出算法的渐近收敛性和采样复杂度。

基于深度Koopman表达的策略学习