In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases.

我们介绍了Context-Aware MultiModal Learner (CaMML)，它是用于调整大型多模态模型 (LMMs)的轻量级模块，通过 seamlessly 将多模态上下文样本集成到大型模型中，使得模型能够从类似的、领域特定的、最新的信息中获取知识并进行基于实地推理。基于 CaMML，我们开发了两个多模态模型，CaMML-7B 和 CaMML-13B，在多个基准数据集上展示出卓越的性能。CaMML-13B 在多个广泛认可的多模态基准数据集中取得了最先进的性能，明显超过 LLaVA-1.5 (13B)，而无需集成任何外部资源。此外，我们还进行了广泛的剔除研究以检查 CaMML 的内部工作原理，并进行了定性分析，展示其在处理现实世界中具有挑战性的情况时的有效性。

CaMML: 大模型的上下文感知多模态学习器