Learning a good representation is a crucial challenge for Reinforcement
Learning (RL) agents. Self-predictive learning provides means to jointly learn
a latent representation and dynamics model by bootstrapping from future latent
representations (BYOL). Recent work has developed theoretical insights into
these algorithms by studying a continuous-time ODE model for self-predictive
representation learning under the simplifying assumption that the algorithm
depends on a fixed policy (BYOL-$\Pi$); this assumption is at odds with
practical instantiations of such algorithms, which explicitly condition their
predictions on future actions. In this work, we take a step towards bridging
the gap between theory and practice by analyzing an action-conditional
self-predictive objective (BYOL-AC) using the ODE framework, characterizing its
convergence properties and highlighting important distinctions between the
limiting solutions of the BYOL-$\Pi$ and BYOL-AC dynamics. We show how the two
representations are related by a variance equation. This connection leads to a
novel variance-like action-conditional objective (BYOL-VAR) and its
corresponding ODE. We unify the study of all three objectives through two
complementary lenses; a model-based perspective, where each objective is shown
to be equivalent to a low-rank approximation of certain dynamics, and a
model-free perspective, which establishes relationships between the objectives
and their respective value, Q-value, and advantage function. Our empirical
investigations, encompassing both linear function approximation and Deep RL
environments, demonstrates that BYOL-AC is better overall in a variety of
different settings.

自预测学习是增强学习代理的关键挑战之一，本文通过分析一个考虑行动条件的自预测目标（BYOL-AC），利用 ODE 框架描述其收敛性质，并突出 BYOL-Pi 和 BYOL-AC 动力学之间的重要区别，展示了两个表示之间的差异和联系。在线性函数逼近和深度 RL 环境中的实证研究结果表明，在各种不同场景下，BYOL-AC 具有更好的性能。

行动条件自预测强化学习的统一框架

A Unifying Framework for Action-Conditional Self-Predictive  Reinforcement Learning

We study the learning dynamics of self-predictive learning for reinforcement
learning, a family of algorithms that learn representations by minimizing the
prediction error of their own future latent representations. Despite its recent
empirical success, such algorithms have an apparent defect: trivial
representations (such as constants) minimize the prediction error, yet it is
obviously undesirable to converge to such solutions. Our central insight is
that careful designs of the optimization dynamics are critical to learning
meaningful representations. We identify that a faster paced optimization of the
predictor and semi-gradient updates on the representation, are crucial to
preventing the representation collapse. Then in an idealized setup, we show
self-predictive learning dynamics carries out spectral decomposition on the
state transition matrix, effectively capturing information of the transition
dynamics. Building on the theoretical insights, we propose bidirectional
self-predictive learning, a novel self-predictive algorithm that learns two
representations simultaneously. We examine the robustness of our theoretical
insights with a number of small-scale experiments and showcase the promise of
the novel representation learning algorithm with large-scale experiments.

本篇研究探讨了自预测学习的学习动态，通过对优化动态的设计，提出了双向自学习算法，并通过一系列实验验证了该算法的有效性。