Credit assignment in reinforcement learning is the problem of measuring an action influence on future rewards. In particular, this requires separating skill from luck, ie. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We then propose to use these as future-conditional baselines and critics in policy gradient algorithms and we develop a valid, practical variant with provably lower variance, while achieving unbiasedness by constraining the hindsight information not to contain information about the agent actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.

本研究应用反事实的思想来解决强化学习领域中action对于未来奖励的影响以及技能和运气的区分问题，并提出了一种使用未来条件价值函数作为基准的策略梯度算法，以及加入了不确定因素的验证和实验，证明了该算法有效性和低方差的特点。

模型无关强化学习中的反事实信用分配