Aiming to advance AI agents, large foundation models significantly improve
reasoning and instruction execution, yet the current focus on vision and
language neglects the potential of perceiving diverse modalities in open-world
environments. However, the success of data-driven vision and language models is
costly or even infeasible to be reproduced for rare modalities. In this paper,
we present ViT-Lens-2 that facilitates efficient omni-modal representation
learning by perceiving novel modalities with a pretrained ViT and aligning them
to a pre-defined space. Specifically, the modality-specific lens is tuned to
project any-modal signals to an intermediate embedding space, which are then
processed by a strong ViT with pre-trained visual knowledge. The encoded
representations are optimized toward aligning with the modal-independent space,
pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified
solution for representation learning of increasing modalities with two
appealing advantages: (i) Unlocking the great potential of pretrained ViTs to
novel modalities effectively with efficient data regime; (ii) Enabling emergent
downstream capabilities through modality alignment and shared ViT parameters.
We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio,
tactile and EEG, and set new state-of-the-art results across various
understanding tasks, such as zero-shot classification. By seamlessly
integrating ViT-Lens-2 into Multimodal Foundation Models, we enable
Any-modality to Text and Image Generation in a zero-shot manner. Code and
models are available at this https URL

通过使用预训练的 ViT 和对齐模态，ViT-Lens-2 提供了一种有效的方法来探索新颖模态的各种新颖任务，并在各种理解任务中取得了新的最佳结果，包括零样本分类。

ViT-Lens-2: 通往全模态智能的入口

ViT-Lens-2: Gateway to Omni-modal Intelligence

Though the success of CLIP-based training recipes in vision-language models,
their scalability to more modalities (e.g., 3D, audio, etc.) is limited to
large-scale data, which is expensive or even inapplicable for rare modalities.
In this paper, we present ViT-Lens that facilitates efficient omni-modal
representation learning by perceiving novel modalities with a pretrained ViT
and aligning to a pre-defined space. Specifically, the modality-specific lens
is tuned to project multimodal signals to the shared embedding space, which are
then processed by a strong ViT that carries pre-trained image knowledge. The
encoded multimodal representations are optimized toward aligning with the
modal-independent space, pre-defined by off-the-shelf foundation models. A
well-trained lens with a ViT backbone has the potential to serve as one of
these foundation models, supervising the learning of subsequent modalities.
ViT-Lens provides a unified solution for representation learning of increasing
modalities with two appealing benefits: (i) Exploiting the pretrained ViT
across tasks and domains effectively with efficient data regime; (ii) Emergent
downstream capabilities of novel modalities are demonstrated due to the
modality alignment space. We evaluate ViT-Lens in the context of 3D as an
initial verification. In zero-shot 3D classification, ViT-Lens achieves
substantial improvements over previous state-of-the-art, showing 52.0% accuracy
on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,
we enable zero-shot 3D question-answering by simply integrating the trained 3D
lens into the InstructBLIP model without any adaptation. We will release the
results of ViT-Lens on more modalities in the near future.

本文介绍了一种名为 ViT-Lens 的方法，通过使用预训练的 ViT 模型感知新颖形式的多模态数据，并与预定义空间进行对齐，从而实现高效的全模态表示学习。在以 3D 为例的验证中，ViT-Lens 在零样本 3D 分类任务中取得了显著的改进，同时还成功将训练好的 3D lens 集成到 InstructBLIP 模型中实现了零样本 3D 问答。