Though the success of CLIP-based training recipes in vision-language models,
their scalability to more modalities (e.g., 3D, audio, etc.) is limited to
large-scale data, which is expensive or even inapplicable for rare modalities.
In this paper, we present ViT-Lens that facilitates efficient omni-modal
representation learning by perceiving novel modalities with a pretrained ViT
and aligning to a pre-defined space. Specifically, the modality-specific lens
is tuned to project multimodal signals to the shared embedding space, which are
then processed by a strong ViT that carries pre-trained image knowledge. The
encoded multimodal representations are optimized toward aligning with the
modal-independent space, pre-defined by off-the-shelf foundation models. A
well-trained lens with a ViT backbone has the potential to serve as one of
these foundation models, supervising the learning of subsequent modalities.
ViT-Lens provides a unified solution for representation learning of increasing
modalities with two appealing benefits: (i) Exploiting the pretrained ViT
across tasks and domains effectively with efficient data regime; (ii) Emergent
downstream capabilities of novel modalities are demonstrated due to the
modality alignment space. We evaluate ViT-Lens in the context of 3D as an
initial verification. In zero-shot 3D classification, ViT-Lens achieves
substantial improvements over previous state-of-the-art, showing 52.0% accuracy
on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,
we enable zero-shot 3D question-answering by simply integrating the trained 3D
lens into the InstructBLIP model without any adaptation. We will release the
results of ViT-Lens on more modalities in the near future.

本文介绍了一种名为 ViT-Lens 的方法，通过使用预训练的 ViT 模型感知新颖形式的多模态数据，并与预定义空间进行对齐，从而实现高效的全模态表示学习。在以 3D 为例的验证中，ViT-Lens 在零样本 3D 分类任务中取得了显著的改进，同时还成功将训练好的 3D lens 集成到 InstructBLIP 模型中实现了零样本 3D 问答。

ViT-Lens: 走向全模态表示

ViT-Lens: Towards Omni-modal Representations

In recent years, 3D representation learning has turned to 2D vision-language
pre-trained models to overcome data scarcity challenges. However, existing
methods simply transfer 2D alignment strategies, aligning 3D representations
with single-view 2D images and coarse-grained parent category text. These
approaches introduce information degradation and insufficient synergy issues,
leading to performance loss. Information degradation arises from overlooking
the fact that a 3D representation should be equivalent to a series of
multi-view images and more fine-grained subcategory text. Insufficient synergy
neglects the idea that a robust 3D representation should align with the joint
vision-language space, rather than independently aligning with each modality.
In this paper, we propose a multi-view joint modality modeling approach, termed
JM3D, to obtain a unified representation for point cloud, text, and image.
Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to
address the information degradation issue, which introduces contiguous
multi-view images and hierarchical text to enrich the representation of vision
and language modalities. A Joint Multi-modal Alignment (JMA) is designed to
tackle the insufficient synergy problem, which models the joint modality by
incorporating language knowledge into the visual modality. Extensive
experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our
proposed method, JM3D, which achieves state-of-the-art performance in zero-shot
3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and
achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy
for zero-shot 3D classification on ModelNet40. The source code and trained
models for all our experiments are publicly available at
this https URL

通过引入多视图联合模态建模方法，该研究论文提出了一种名为 JM3D 的新方法，以解决 3D 表示学习中的信息降解和不足协同问题，并在零样本 3D 分类任务上取得了领先于现有方法的性能。