Real-world applications of reinforcement learning for recommendation and
experimentation faces a practical challenge: the relative reward of different
bandit arms can evolve over the lifetime of the learning agent. To deal with
these non-stationary cases, the agent must forget some his