What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.

马尔可夫决策过程中，奖励的功能有哪些可以精确计算和优化？我们总结了策略评估相关类的特性，给出了规划问题的新解答。同时，我们证明了只有广义平均数能够被精确优化，即使在分布式强化学习的更通用框架下也是如此。这些结果为推进马尔可夫决策过程的理论发展做出了贡献，尤其关注回报的整体特征和风险感知策略。

马尔可夫决策过程中的超越平均回报