We investigate an infinite-horizon average reward markov decision process
(MDP) with delayed, composite, and partially anonymous reward feedback. The
delay and compositeness of rewards mean that rewards generated as a result of
taking an action at a given state are fragmented into diff