We investigate an infinite-horizon average reward Markov Decision Process
(MDP) with delayed, composite, and partially anonymous reward feedback. The
delay and compositeness of rewards mean that rewards generated as a result of
taking an action at a given state are fragmented into different components, and
they are sequentially realized at delayed time instances. The partial anonymity
attribute implies that a learner, for each state, only observes the aggregate
of past reward components generated as a result of different actions taken at
that state, but realized at the observation instance. We propose an algorithm
named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and
show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} +
d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action
spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper
bounded by the maximum reward delay, and $T$ denotes the time horizon. This
demonstrates the optimality of the bound in the order of $T$, and an additive
impact of the delay.

本文研究了具有延迟、组合和部分匿名奖励反馈的无限期望回报马尔可夫决策过程，并提出了名为 DUCRL2 的算法来获得近似最优策略，并证明其达到了类似于 ODS 的遗憾界。