Large Language Models (LLMs), benefiting from the auto-regressive modelling
approach performed on massive unannotated texts corpora, demonstrates powerful
perceptual and reasoning capabilities. However, as for extending
auto-regressive modelling to multi-modal scenarios to build Large Multi-modal
Models (LMMs), there lies a great difficulty that the image information is
processed in the LMM as continuous visual embeddings, which cannot obtain
discrete supervised labels for classification. In this paper, we successfully
perform multi-modal auto-regressive modeling with a unified objective for the
first time. Specifically, we propose the concept of visual words, which maps
the visual features to probability distributions over LLM's vocabulary,
providing supervision information for visual modelling. We further explore the
distribution of visual features in the semantic space within LMM and the
possibility of using text embeddings to represent visual information.
Experimental results and ablation studies on 5 VQA tasks and 4 benchmark
toolkits validate the powerful performance of our proposed approach.

成功进行多模态自回归建模，并首次提出了视觉词概念，将视觉特征映射到 LLMs 词汇的概率分布，为视觉建模提供了监督信息。通过对 5 个 VQA 任务和 4 个基准工具包的实验结果和消融研究的验证，证明了我们提出方法的强大性能。

多模态自回归建模基于视觉单词

Multi-modal Auto-regressive Modeling via Visual Words

Visual Programming (VP) has emerged as a powerful framework for Visual
Question Answering (VQA). By generating and executing bespoke code for each
question, these methods demonstrate impressive compositional and reasoning
capabilities, especially in few-shot and zero-shot scenarios. However, existing
VP methods generate all code in a single function, resulting in code that is
suboptimal in terms of both accuracy and interpretability. Inspired by human
coding practices, we propose Recursive Visual Programming (RVP), which
simplifies generated routines, provides more efficient problem solving, and can
manage more complex data structures. RVP is inspired by human coding practices
and approaches VQA tasks with an iterative recursive code generation approach,
allowing decomposition of complicated problems into smaller parts. Notably, RVP
is capable of dynamic type assignment, i.e., as the system recursively
generates a new piece of code, it autonomously determines the appropriate
return type and crafts the requisite code to generate that output. We show
RVP's efficacy through extensive experiments on benchmarks including VSR, COVR,
GQA, and NextQA, underscoring the value of adopting human-like recursive and
modular programming techniques for solving VQA tasks through coding.

通过递归的视觉编程方法来应对编码解决视觉问答任务，简化生成的代码、提供更高效的问题解决能力以及更好管理复杂的数据结构，并通过广泛实验验证了该方法的有效性。

递归视觉编程

Recursive Visual Programming

Visual question answering (VQA) has traditionally been treated as a
single-step task where each question receives the same amount of effort, unlike
natural human question-answering strategies. We explore a question
decomposition strategy for VQA to overcome this limitation. We probe the
ability of recently developed large vision-language models to use human-written
decompositions and produce their own decompositions of visual questions,
finding they are capable of learning both tasks from demonstrations alone.
However, we show that naive application of model-written decompositions can
hurt performance. We introduce a model-driven selective decomposition approach
for second-guessing predictions and correcting errors, and validate its
effectiveness on eight VQA tasks across three domains, showing consistent
improvements in accuracy, including improvements of >20% on medical VQA
datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA
reformulation of the challenging Winoground task. Project Site:
this https URL

通过研究和应用视觉 - 语言模型，本文提出了问题分解策略和模型驱动的选择性分解方法，以提高视觉问答任务的准确性和性能。