The increasing availability of multi-sensor data sparks interest in
multimodal self-supervised learning. However, most existing approaches learn
only common representations across modalities while ignoring intra-modal
training and modality-unique representations. We propose Decoupling Common and
Unique Representations (DeCUR), a simple yet effective method for multimodal
self-supervised learning. By distinguishing inter- and intra-modal embeddings,
DeCUR is trained to integrate complementary information across different
modalities. We evaluate DeCUR in three common multimodal scenarios
(radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent
benefits on scene classification and semantic segmentation downstream tasks.
Notably, we get straightforward improvements by transferring our pretrained
backbones to state-of-the-art supervised multimodal methods without any
hyperparameter tuning. Furthermore, we conduct a comprehensive explainability
analysis to shed light on the interpretation of common and unique features in
our multimodal approach. Codes are available at
https://github.com/zhu-xlab/DeCUR.

通过多传感器数据的多模态自我监督学习，提出了一种区分共有和独特表示的方法（Decoupling Common and Unique Representations，DeCUR），在场景分类和语义分割的下游任务中展现出一致的优势。

DeCUR: 多模态自监督中的公共和独特表达解耦

DeCUR: decoupling common & unique representations for multimodal  self-supervision

Human Activity Recognition is a field of research where input data can take
many forms. Each of the possible input modalities describes human behaviour in
a different way, and each has its own strengths and weaknesses. We explore the
hypothesis that leveraging multiple modalities can lead to better recognition.
Since manual annotation of input data is expensive and time-consuming, the
emphasis is made on self-supervised methods which can learn useful feature
representations without any ground truth labels. We extend a number of recent
contrastive self-supervised approaches for the task of Human Activity
Recognition, leveraging inertial and skeleton data. Furthermore, we propose a
flexible, general-purpose framework for performing multimodal self-supervised
learning, named Contrastive Multiview Coding with Cross-Modal Knowledge Mining
(CMC-CMKM). This framework exploits modality-specific knowledge in order to
mitigate the limitations of typical self-supervised frameworks. The extensive
experiments on two widely-used datasets demonstrate that the suggested
framework significantly outperforms contrastive unimodal and multimodal
baselines on different scenarios, including fully-supervised fine-tuning,
activity retrieval and semi-supervised learning. Furthermore, it shows
performance competitive even compared to supervised methods.

本文提出了一个名为 CMC-CMKM 的多模态自监督学习框架，可以学习到更好的人体活动识别特征。在两个广泛使用的数据集上进行的广泛实验表明，该框架在不同场景下的性能显著优于对比单模态和多模态基线，在有些情况下甚至可以与监督方法相竞争。

使用跨模态知识挖掘的对比学习进行多模态人体活动识别

Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal Human Activity Recognition

Multimodal self-supervised learning is getting more and more attention as it
allows not only to train large networks without human supervision but also to
search and retrieve data across various modalities. In this context, this paper
proposes a self-supervised training framework that learns a common multimodal
embedding space that, in addition to sharing representations across different
modalities, enforces a grouping of semantically similar instances. To this end,
we extend the concept of instance-level contrastive learning with a multimodal
clustering step in the training pipeline to capture semantic similarities
across modalities. The resulting embedding space enables retrieval of samples
across all modalities, even from unseen datasets and different domains. To
evaluate our approach, we train our model on the HowTo100M dataset and evaluate
its zero-shot retrieval capabilities in two challenging domains, namely
text-to-video retrieval, and temporal action localization, showing
state-of-the-art results on four different datasets.

本文提出了一个自监督训练框架，通过在训练管道中增加多模态聚类步骤以捕捉跨模态的语义相似性，进而学习一个共同的多模态嵌入空间，并证明其能在文本到视频检索和时间动作定位等两个具有挑战性的领域展示出四个不同数据集上的最新成果.