Learning a good representation is a crucial challenge for Reinforcement
Learning (RL) agents. Self-predictive learning provides means to jointly learn
a latent representation and dynamics model by bootstrapping from future latent
representations (BYOL). Recent work has developed theoretical insights into
these algorithms by studying a continuous-time ODE model for self-predictive
representation learning under the simplifying assumption that the algorithm
depends on a fixed policy (BYOL-$\Pi$); this assumption is at odds with
practical instantiations of such algorithms, which explicitly condition their
predictions on future actions. In this work, we take a step towards bridging
the gap between theory and practice by analyzing an action-conditional
self-predictive objective (BYOL-AC) using the ODE framework, characterizing its
convergence properties and highlighting important distinctions between the
limiting solutions of the BYOL-$\Pi$ and BYOL-AC dynamics. We show how the two
representations are related by a variance equation. This connection leads to a
novel variance-like action-conditional objective (BYOL-VAR) and its
corresponding ODE. We unify the study of all three objectives through two
complementary lenses; a model-based perspective, where each objective is shown
to be equivalent to a low-rank approximation of certain dynamics, and a
model-free perspective, which establishes relationships between the objectives
and their respective value, Q-value, and advantage function. Our empirical
investigations, encompassing both linear function approximation and Deep RL
environments, demonstrates that BYOL-AC is better overall in a variety of
different settings.

自预测学习是增强学习代理的关键挑战之一，本文通过分析一个考虑行动条件的自预测目标（BYOL-AC），利用 ODE 框架描述其收敛性质，并突出 BYOL-Pi 和 BYOL-AC 动力学之间的重要区别，展示了两个表示之间的差异和联系。在线性函数逼近和深度 RL 环境中的实证研究结果表明，在各种不同场景下，BYOL-AC 具有更好的性能。