Visual representation learning has been a cornerstone in computer vision,
evolving from supervised learning with human-annotated labels to aligning
image-text pairs from the Internet. Despite recent advancements in multi-modal
large language models (MLLMs), the visual representations they rely on, such as
CLIP embeddings, often lack access to external world knowledge critical for
real-world visual reasoning. In this work, we propose Visual Table, a novel
visual representation tailored for MLLMs. It provides hierarchical text
descriptions of holistic visual scenes, consisting of a scene description and
multiple object-centric descriptions that encompass categories, attributes, and
knowledge at instance level. We further develop a scalable generator for visual
table generation and train it on small-scale annotations from GPT4V. Extensive
evaluations demonstrate that, with generated visual tables as additional visual
representations, our model can consistently outperform the state-of-the-art
(SOTA) MLLMs across diverse benchmarks. When visual tables serve as standalone
visual representations, our model can closely match or even beat the SOTA MLLMs
that are built on CLIP visual embeddings. Our code is available at
this https URL

本研究提出了一种用于多模态大型语言模型的新型视觉表达方法 ——Visual Table，它提供了层次化的视觉场景文本描述，并包括了场景描述和多个以对象为中心的描述，涵盖了类别、属性和实例级别的知识。通过生成的视觉表格作为额外的视觉表示，我们的模型在多个基准测试中始终优于现有最先进的多模态大型语言模型。当视觉表格作为独立的视觉表示时，我们的模型可以与甚至超过基于 CLIP 视觉嵌入的最先进的多模态大型语言模型。