Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. By applying a meta-learning approach, we identify more efficient mirror maps that enhance performance, both on average and in terms of best performance achieved along the training trajectory. We analyze the characteristics of these learned mirror maps and reveal shared traits among certain settings. Our results suggest that mirror maps have the potential to be adaptable across various environments, raising questions about how to best match a mirror map to an environment's structure and characteristics.

我们的研究发现，传统的镜像映射选择（NPG）在多个标准基准环境下往往产生次优结果。通过应用元学习方法，我们确定了提高性能的更有效的镜像映射，并分析了这些学习到的镜像映射的特点，揭示了某些设置之间的共享特征。我们的结果表明，镜像映射有潜力在各种环境中适应，这引发了如何最好地将镜像映射与环境的结构和特性相匹配的问题。

策略镜像下的元学习及其镜像映射