Large Language Models (LLMs), benefiting from the auto-regressive modelling
approach performed on massive unannotated texts corpora, demonstrates powerful
perceptual and reasoning capabilities. However, as for extending
auto-regressive modelling to multi-modal scenarios to build Large Multi-modal
Models (LMMs), there lies a great difficulty that the image information is
processed in the LMM as continuous visual embeddings, which cannot obtain
discrete supervised labels for classification. In this paper, we successfully
perform multi-modal auto-regressive modeling with a unified objective for the
first time. Specifically, we propose the concept of visual words, which maps
the visual features to probability distributions over LLM's vocabulary,
providing supervision information for visual modelling. We further explore the
distribution of visual features in the semantic space within LMM and the
possibility of using text embeddings to represent visual information.
Experimental results and ablation studies on 5 VQA tasks and 4 benchmark
toolkits validate the powerful performance of our proposed approach.

成功进行多模态自回归建模，并首次提出了视觉词概念，将视觉特征映射到 LLMs 词汇的概率分布，为视觉建模提供了监督信息。通过对 5 个 VQA 任务和 4 个基准工具包的实验结果和消融研究的验证，证明了我们提出方法的强大性能。

多模态自回归建模基于视觉单词

Multi-modal Auto-regressive Modeling via Visual Words

In real-world scenarios, achieving domain generalization (DG) presents
significant challenges as models are required to generalize to unknown target
distributions. Generalizing to unseen multi-modal distributions poses even
greater difficulties due to the distinct properties exhibited by different
modalities. To overcome the challenges of achieving domain generalization in
multi-modal scenarios, we propose SimMMDG, a simple yet effective multi-modal
DG framework. We argue that mapping features from different modalities into the
same embedding space impedes model generalization. To address this, we propose
splitting the features within each modality into modality-specific and
modality-shared components. We employ supervised contrastive learning on the
modality-shared features to ensure they possess joint properties and impose
distance constraints on modality-specific features to promote diversity. In
addition, we introduce a cross-modal translation module to regularize the
learned features, which can also be used for missing-modality generalization.
We demonstrate that our framework is theoretically well-supported and achieves
strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel
Human-Animal-Cartoon (HAC) dataset introduced in this paper. Our source code
and HAC dataset are available at this https URL

通过在多模态情境中将特征分为模态特定和模态共享组件，并运用监督对比学习对模态共享特征施加距离约束，以促进多样性，并引入跨模态转换模块来规范学习特征，以达到领域泛化的目标。