We consider an agent who is involved in a Markov decision process and receives a vector of outcomes every round. Her objective is to maximize a global concave reward function on the average vectorial outcome. The problem models applications such as multi-objective optimization, maximum entropy exploration, and constrained optimization in Markovian environments. In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes. In particular, stationary policies are in general sub-optimal. We propose a no-regret algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur 2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off. By delaying the gradient updates, our procedure produces a non-stationary policy that diversifies the outcomes for optimizing the objective. The procedure is compatible with a variety of OCO tools.

研究了在Markov决策问题中，代理人通过在线凸规划算法设计非固定策略，以最大化全局凹奖励函数和矢量结果的均值，以解决多目标优化和Markov环境下的受限优化问题。

在线马尔可夫决策过程中全局凸奖励的强化学习中的勘探利用权衡