Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.

利用离线数据进行学习是实现高效强化学习的关键，本文以优势函数作为行动对应回报的因果效应为基础，将轨迹的回报分解为受智能体行动（技能）和超出智能体控制范围的部分（运气）所影响的部分，并利用此分解将直接优势估计（DAE）自然地扩展到离线策略（离线DAE）的环境中。这种方法能够从离线轨迹中学习，不依赖于重要性采样技术或截断离线策略行动。通过与之前的方法进行对比，展示了离线DAE如何加速学习，并明确了建议的离线策略校正的重要性。最后，使用MinAtar环境说明忽略离线策略校正可能导致次优的策略优化性能。

技能还是运气？通过优势函数进行回报分解