MDPs with low-rank transitions -- that is, the transition matrix can be
factored into the product of two matrices, left and right -- is a highly
representative structure that enables tractable learning. The left matrix
enables expressive function approximation for value-based learning and has been
studied extensively. In this work, we instead investigate sample-efficient
learning with density features, i.e., the right matrix, which induce powerful
models for state-occupancy distributions. This setting not only sheds light on
leveraging unsupervised learning in RL, but also enables plug-in solutions for
convex RL. In the offline setting, we propose an algorithm for off-policy
estimation of occupancies that can handle non-exploratory data. Using this as a
subroutine, we further devise an online algorithm that constructs exploratory
data distributions in a level-by-level manner. As a central technical
challenge, the additive error of occupancy estimation is incompatible with the
multiplicative definition of data coverage. In the absence of strong
assumptions like reachability, this incompatibility easily leads to exponential
error blow-up, which we overcome via novel technical tools. Our results also
readily extend to the representation learning setting, when the density
features are unknown and must be learned from an exponentially large candidate
set.

这篇论文研究了具有低秩转移矩阵的 MDPs，尤其是密度特征的样本高效学习，提出了一种算法来处理非勘探性数据的离线场景和逐层构建勘探数据分布的在线算法。

低秩 MDP 中的密度特征强化学习

Reinforcement Learning in Low-Rank MDPs with Density Features

In view of its power in extracting feature representation, contrastive
self-supervised learning has been successfully integrated into the practice of
(deep) reinforcement learning (RL), leading to efficient policy learning in
various applications. Despite its tremendous empirical successes, the
understanding of contrastive learning for RL remains elusive. To narrow such a
gap, we study how RL can be empowered by contrastive learning in a class of
Markov decision processes (MDPs) and Markov games (MGs) with low-rank
transitions. For both models, we propose to extract the correct feature
representations of the low-rank model by minimizing a contrastive loss.
Moreover, under the online setting, we propose novel upper confidence bound
(UCB)-type algorithms that incorporate such a contrastive loss with online RL
algorithms for MDPs or MGs. We further theoretically prove that our algorithm
recovers the true representations and simultaneously achieves sample efficiency
in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also
provide empirical studies to demonstrate the efficacy of the UCB-based
contrastive learning method for RL. To the best of our knowledge, we provide
the first provably efficient online RL algorithm that incorporates contrastive
learning for representation learning. Our codes are available at
this https URL

通过最小化对比损失，提取正确的特征表达，将自对比自监督学习引入马尔可夫决策过程和马尔可夫游戏中，进一步提出结合在线 RL 算法的 UCB-type 算法，理论上提出我们的算法恢复真实表示，并同时在学习最优政策和 Nash 平衡方面实现样本效率。