How to accurately learn task-relevant state representations from
high-dimensional observations with visual distractions is a realistic and
challenging problem in visual reinforcement learning. Recently, unsupervised
representation learning methods based on bisimulation metrics, contrast,
prediction, and reconstruction have shown the ability for task-relevant
information extraction. However, due to the lack of appropriate mechanisms for
the extraction of task information in the prediction, contrast, and
reconstruction-related approaches and the limitations of bisimulation-related
methods in domains with sparse rewards, it is still difficult for these methods
to be effectively extended to environments with distractions. To alleviate
these problems, in the paper, the action sequences, which contain
task-intensive signals, are incorporated into representation learning.
Specifically, we propose a Sequential Action--induced invariant Representation
(SAR) method, in which the encoder is optimized by an auxiliary learner to only
preserve the components that follow the control signals of sequential actions,
so the agent can be induced to learn the robust representation against
distractions. We conduct extensive experiments on the DeepMind Control suite
tasks with distractions while achieving the best performance over strong
baselines. We also demonstrate the effectiveness of our method at disregarding
task-irrelevant information by deploying SAR to real-world CARLA-based
autonomous driving with natural distractions. Finally, we provide the analysis
results of generalization drawn from the generalization decay and t-SNE
visualization. Code and demo videos are available at
this https URL

通过序列动作导致不变表示法（SAR）方法，针对具有视觉干扰的高维观测中准确学习与任务相关的状态表示的问题，本文提出了一种能抵抗干扰的表示学习方法，通过编码器优化学习从而仅保留顺序动作控制信号后的组件，使得智能体能够学习到鲁棒的表示形式，并在实验中展示了该方法在对抗干扰任务和真实世界自动驾驶场景中的有效性。