A significant challenge in multi-objective reinforcement learning is
obtaining a Pareto front of policies that attain optimal performance under
different preferences. We introduce Iterated Pareto Referent Optimisation
(IPRO), a principled algorithm that decomposes the task of finding the Pareto
front into a sequence of single-objective problems for which various solution
methods exist. This enables us to establish convergence guarantees while
providing an upper bound on the distance to undiscovered Pareto optimal
solutions at each step. Empirical evaluations demonstrate that IPRO matches or
outperforms methods that require additional domain knowledge. By leveraging
problem-specific single-objective solvers, our approach also holds promise for
applications beyond multi-objective reinforcement learning, such as in
pathfinding and optimisation.

多目标强化学习中的一个重要挑战是在不同偏好下获得达到最优性能的政策帕累托前沿，本文引入了迭代帕累托参考优化（IPRO），一种将寻找帕累托前沿的任务分解成一系列单目标问题的原则性算法，以此实现收敛性保证并在每一步给出到未发现帕累托最优解的距离上限，实证评估表明 IPRO 与需要额外领域知识的方法相当甚至优于其，通过利用问题特定的单目标求解器，本方法也在路径规划和优化等领域具有潜力。