In the realm of Multimodal Large Language Models (MLLMs), vision-language
connector plays a crucial role to link the pre-trained vision encoders with
Large Language Models (LLMs). Despite its importance, the vision-language
connector has been relatively less explored. In this study, we aim to propose a
strong vision-language connector that enables MLLMs to achieve high accuracy
while maintain low computation cost. We first reveal the existence of the
visual anchors in Vision Transformer and propose a cost-effective search
algorithm to extract them. Building on these findings, we introduce the Anchor
Former (AcFormer), a novel vision-language connector designed to leverage the
rich prior knowledge obtained from these visual anchors during pretraining,
guiding the aggregation of information. Through extensive experimentation, we
demonstrate that the proposed method significantly reduces computational costs
by nearly two-thirds compared with baseline, while simultaneously outperforming
baseline methods. This highlights the effectiveness and efficiency of AcFormer.

本研究提出了一种强大的视觉 - 语言连接器，通过挖掘视觉锚点并在预训练中利用其丰富的先验知识，实现高准确性和低计算成本的多模态大型语言模型。通过广泛的实验验证，该方法相比基线方法将计算成本减少了三分之二，同时表现更好，突显了 AcFormer 的效果和效率。

可视锚点是多模态大语言模型的强信息聚合器

Visual Anchors Are Strong Information Aggregators For Multimodal Large  Language Model

In this work, we discuss building performant Multimodal Large Language Models
(MLLMs). In particular, we study the importance of various architecture
components and data choices. Through careful and comprehensive ablations of the
image encoder, the vision language connector, and various pre-training data
choices, we identified several crucial design lessons. For example, we
demonstrate that for large-scale multimodal pre-training using a careful mix of
image-caption, interleaved image-text, and text-only data is crucial for
achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks,
compared to other published pre-training results. Further, we show that the
image encoder together with image resolution and the image token count has
substantial impact, while the vision-language connector design is of
comparatively negligible importance. By scaling up the presented recipe, we
build MM1, a family of multimodal models up to 30B parameters, consisting of
both dense models and mixture-of-experts (MoE) variants, that are SOTA in
pre-training metrics and achieve competitive performance after supervised
fine-tuning on a range of established multimodal benchmarks. Thanks to
large-scale pre-training, MM1 enjoys appealing properties such as enhanced
in-context learning, and multi-image reasoning, enabling few-shot
chain-of-thought prompting.

讨论构建出色的多模态大型语言模型的重要组成部分和数据选择，通过仔细和全面的分析，证明了使用图像 - 标题、交错图像 - 文本和仅文本数据进行大规模多模态预训练对于在多个基准测试中实现最新成果至关重要。通过扩展所提出的模型，构建了以稠密模型和专家混合模型为特征的 MM1 系列多模态模型，这些模型在预训练指标上取得最新成果，并在一系列已建立的多模态基准测试中实现了有竞争力的性能。