A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximising entropy bonuses additional to its cumulative rewards. MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective. For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space. MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework. Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence. As a proof of concept, we evaluate numerically MPG on standard test benchmarks.

本研究介绍并研究了一种名为 Matryoshka Policy Gradient（MPG）的新型策略梯度算法，可在最大熵强化学习环境下学习相互关联的有限任务，证明了其收敛性和全局最优性，适用于神经网络参数化的策略优化，以及在标准测试基准上的数值评估。

熵正则化强化学习的套娃政策梯度：收敛和全局最优性