Despite recent progress in reinforcement learning (RL) from raw pixel data,
sample inefficiency continues to present a substantial obstacle. Prior works
have attempted to address this challenge by creating self-supervised auxiliary
tasks, aiming to enrich the agent's learned representations with
control-relevant information for future state prediction. However, these
objectives are often insufficient to learn representations that can represent
the optimal policy or value function, and they often consider tasks with small,
abstract discrete action spaces and thus overlook the importance of action
representation learning in continuous control. In this paper, we introduce
TACO: Temporal Action-driven Contrastive Learning, a simple yet powerful
temporal contrastive learning approach that facilitates the concurrent
acquisition of latent state and action representations for agents. TACO
simultaneously learns a state and an action representation by optimizing the
mutual information between representations of current states paired with action
sequences and representations of the corresponding future states.
Theoretically, TACO can be shown to learn state and action representations that
encompass sufficient information for control, thereby improving sample
efficiency. For online RL, TACO achieves 40% performance boost after one
million environment interaction steps on average across nine challenging visual
continuous control tasks from Deepmind Control Suite. In addition, we show that
TACO can also serve as a plug-and-play module adding to existing offline visual
RL methods to establish the new state-of-the-art performance for offline visual
RL across offline datasets with varying quality.

本文介绍了一种名为 TACO 的时间驱动对比学习方法，通过优化当前状态与行动序列表示和相应未来状态表示之间的相互信息，同时学习状态和行动表示，并在深度强化学习的多个方面上实现了性能提升。

基于时序和潜变量的对比损失的视觉强化学习方法：TACO

TACO: Temporal Latent Action-Driven Contrastive Loss for Visual  Reinforcement Learning

Prior works on action representation learning mainly focus on designing
various architectures to extract the global representations for short video
clips. In contrast, many practical applications such as video alignment have
strong demand for learning dense representations for long videos. In this
paper, we introduce a novel contrastive action representation learning (CARL)
framework to learn frame-wise action representations, especially for long
videos, in a self-supervised manner. Concretely, we introduce a simple yet
efficient video encoder that considers spatio-temporal context to extract
frame-wise representations. Inspired by the recent progress of self-supervised
learning, we present a novel sequence contrastive loss (SCL) applied on two
correlated views obtained through a series of spatio-temporal data
augmentations. SCL optimizes the embedding space by minimizing the
KL-divergence between the sequence similarity of two augmented views and a
prior Gaussian distribution of timestamp distance. Experiments on FineGym,
PennAction and Pouring datasets show that our method outperforms previous
state-of-the-art by a large margin for downstream fine-grained action
classification. Surprisingly, although without training on paired videos, our
approach also shows outstanding performance on video alignment and fine-grained
frame retrieval tasks. Code and models are available at
this https URL

本文提出了一种新颖的对比行动表示学习（CARL）框架，用于以自我监督的方式学习帧级行动表示，特别是针对长视频。该框架包括一个简单而高效的视频编码器，以及应用于一系列时空数据增强的新颖序列对比损失（SCL）。我们通过 FineGym，PennAction 和 Pouring 数据集的实验证明，该方法在下游的细粒度行动分类任务上表现出明显的优越性。 令人惊讶的是，即使没有对配对视频进行训练，我们的方法在视频对齐和细粒度帧检索任务上也表现出了出色的性能。