We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a
vision-centric approach. While stronger language models can enhance multimodal
capabilities, the design choices for vision components are often insufficiently
explored and disconnected from visual representation learning research. This
gap hinders accurate sensory grounding in real-world scenarios. Our study uses
LLMs and visual instruction tuning as an interface to evaluate various visual
representations, offering new insights into different models and architectures
-- self-supervised, strongly supervised, or combinations thereof -- based on
experiments with over 20 vision encoders. We critically examine existing MLLM
benchmarks, addressing the difficulties involved in consolidating and
interpreting results from various tasks, and introduce a new vision-centric
benchmark, CV-Bench. To further improve visual grounding, we propose the
Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that
integrates high-resolution vision features with LLMs while reducing the number
of tokens. Additionally, we discuss the curation of high-quality visual
instruction-tuning data from publicly available sources, emphasizing the
importance of data source balancing and distribution ratio. Collectively,
Cambrian-1 not only achieves state-of-the-art performance but also serves as a
comprehensive, open cookbook for instruction-tuned MLLMs. We provide model
weights, code, supporting tools, datasets, and detailed instruction-tuning and
evaluation recipes. We hope our release will inspire and accelerate
advancements in multimodal systems and visual representation learning.

我们引入了 Cambrian-1，一种以视觉为中心的多模态 LLMs（MLLMs）系列，通过视觉指导调整作为接口，评估各种视觉表示，并提出了空间视觉聚合器（SVA）来进一步改进视觉定位，同时降低标记数量。此外，我们讨论了从公开可用的资源中策划高质量的视觉指导调整数据的重要性，并提供了模型权重、代码、支持工具、数据集以及详细的指导调整和评估方案，希望我们的发布能够激发和加速多模态系统和视觉表示学习的进步。

Cambrian-1：全面开放、以视觉为中心的多模态 LLMs 研究

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

In this article, we present our approach to single-modality vision
representation learning. Understanding vision representations of product
content is vital for recommendations, search, and advertising applications in
e-commerce. We detail and contrast techniques used to fine tune large-scale
vision representation learning models in an efficient manner under low-resource
settings, including several pretrained backbone architectures, both in the
convolutional neural network as well as the vision transformer family. We
highlight the challenges for e-commerce applications at-scale and highlight the
efforts to more efficiently train, evaluate, and serve visual representations.
We present ablation studies for several downstream tasks, including our
visually similar ad recommendations. We evaluate the offline performance of the
derived visual representations in downstream tasks. To this end, we present a
novel text-to-image generative offline evaluation method for visually similar
recommendation systems. Finally, we include online results from deployed
machine learning systems in production at Etsy.

本文提出了一种单模态视觉表征学习的方法，主要用于电子商务中的产品推荐、搜索和广告应用，包括预训练骨干架构、卷积神经网络和视觉变换器家族等。通过离线和在线的方式，我们对实验方法进行了评估和分析，并提出了新的文本到图像生成离线评估方法来评估视觉相似度的推荐系统，在 Etsy 的生产环境中进行了机器学习系统的应用。