Maximising a cumulative reward function that is Markov and stationary, i.e.,
defined over state-action pairs and independent of time, is sufficient to
capture many kinds of goals in a markov decision process (MDP). However, not
all goals can be captured in this manner. In this paper we