In this paper, we investigate the concentration properties of cumulative rewards in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a novel martingale decomposition of cumulative rewards, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.

本文研究了马尔可夫决策过程（MDP）中累积奖励的集中性特性，旨在填补现有文献在这一领域的不足。我们提出了一种统一的方法来表征MDP中的奖励集中性，涵盖了无限期和有限期的设置，发现了样本路径中不同静态策略之间奖励差异的行为及其对学习策略后悔率定义的影响，从而为MDP的分析提供了新的视角。

马尔可夫决策过程中的累积奖励集中性