One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.

提出了一种名为M2CURL的多模态对比无监督强化学习方法，该方法可以有效整合不同的观察模态，通过学习高效的表征进而提高强化学习算法的稳健性和样本效率。该方法在触觉模拟环境中得到了验证，相较于标准的强化学习算法，其学习效率显著提高，表现为更快的收敛速度和更高的累积奖励。

M2CURL: 机器人操作的自主监督表征学习实现高效的多模态强化学习