Autonomous agents must often deal with conflicting requirements, such as
completing tasks using the least amount of time/energy, learning multiple
tasks, or dealing with multiple opponents. In the context of reinforcement
learning~(RL), these problems are addressed by (i)~designing a reward function
that simultaneously describes all requirements or (ii)~combining modular value
functions that encode them individually. Though effective, these methods have
critical downsides. Designing good reward functions that balance different
objectives is challenging, especially as the number of objectives grows.
Moreover, implicit interference between goals may lead to performance plateaus
as they compete for resources, particularly when training on-policy. Similarly,
selecting parameters to combine value functions is at least as hard as
designing an all-encompassing reward, given that the effect of their values on
the overall policy is not straightforward. The later is generally addressed by
formulating the conflicting requirements as a constrained RL problem and solved
using Primal-Dual methods. These algorithms are in general not guaranteed to
converge to the optimal solution since the problem is not convex. This work
provides theoretical support to these approaches by establishing that despite
its non-convexity, this problem has zero duality gap, i.e., it can be solved
exactly in the dual domain, where it becomes convex. Finally, we show this
result basically holds if the policy is described by a good
parametrization~(e.g., neural networks) and we connect this result with
primal-dual algorithms present in the literature and we establish the
convergence to the optimal solution.

本文针对自主制约智能方面存在的困境进行研究，主要研究如何应用 Primal-Dual 方法使其具有收敛性。通过探究多目标收益函数，多目标学习和多目标值函数相结合等方法的局限性，提出 Primal-Dual 算法。与其他算法不同，本方法可以在把冲突目标转化为受限制 RL 问题后得到实际的最优解，具有收敛性，并且可以扩展到一些神经网络模型上。