Modern meta-reinforcement learning (Meta-RL) methods are mainly developed
based on model-agnostic meta-learning, which performs policy gradient steps
across tasks to maximize policy performance. However, the gradient conflict
problem is still poorly understood in Meta-RL, which may lead to performance
degradation when encountering distinct tasks. To tackle this challenge, this
paper proposes a novel personalized Meta-RL (pMeta-RL) algorithm, which
aggregates task-specific personalized policies to update a meta-policy used for
all tasks, while maintaining personalized policies to maximize the average
return of each task under the constraint of the meta-policy. We also provide
the theoretical analysis under the tabular setting, which demonstrates the
convergence of our pMeta-RL algorithm. Moreover, we extend the proposed
pMeta-RL algorithm to a deep network version based on soft actor-critic, making
it suitable for continuous control tasks. Experiment results show that the
proposed algorithms outperform other previous Meta-RL algorithms on Gym and
MuJoCo suites.

该论文提出了一种个性化元强化学习算法 (pMeta-RL)，旨在解决元强化学习中的梯度冲突问题，该算法将任务特定的个性化策略汇总以更新用于所有任务的元策略，同时保持个性化策略以最大化每个任务的平均回报。该算法在离散和连续控制任务中的实验表明，优于其他以往的 Meta-RL 算法。