Maximising a cumulative reward function that is Markov and stationary, i.e.,
defined over state-action pairs and independent of time, is sufficient to
capture many kinds of goals in a Markov decision process (MDP). However, not
all goals can be captured in this manner. In this paper we study convex MDPs in
which goals are expressed as convex functions of the stationary distribution
and show that they cannot be formulated using stationary reward functions.
Convex MDPs generalize the standard reinforcement learning (RL) problem
formulation to a larger framework that includes many supervised and
unsupervised RL problems, such as apprenticeship learning, constrained MDPs,
and so-called `pure exploration'. Our approach is to reformulate the convex MDP
problem as a min-max game involving policy and cost (negative reward)
`players', using Fenchel duality. We propose a meta-algorithm for solving this
problem and show that it unifies many existing algorithms in the literature.

本文研究在马尔可夫决策过程中用凸函数表达目标的问题，使用 Fenchel 对偶将其重新表达为一个涉及策略和成本（负奖励）的 min-max 博弈，并提出一个元算法以统一现有文献中的各种算法。