While large multi-modal models (LMMs) have exhibited impressive capabilities
across diverse tasks, their effectiveness in handling complex tasks has been
limited by the prevailing single-step reasoning paradigm. To this end, this
paper proposes VoCoT, a multi-step Visually grounded object-centric
Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is
characterized by two key features: (1) object-centric reasoning paths that
revolve around cross-modal shared object-level information, and (2) visually
grounded representation of object concepts in a multi-modal interleaved and
aligned manner, which effectively bridges the modality gap within LMMs during
long-term generation. Additionally, we construct an instruction dataset to
facilitate LMMs in adapting to reasoning with VoCoT. By introducing VoCoT into
the prevalent open-source LMM architecture, we introduce VolCano. With only 7B
parameters and limited input resolution, VolCano demonstrates excellent
performance across various scenarios, surpassing SOTA models, including GPT-4V,
in tasks requiring complex reasoning. Our code, data and model will be
available at this https URL

该研究论文提出了 VoCoT，一种适合于大型多模态模型推理的多步骤可视化物体为中心的连贯思维推理框架，通过在 LMMs 中引入 VoCoT，实现了在长期生成过程中跨模态信息的有效连接，并在各种场景中表现出优秀的性能，超过了需要复杂推理的 SOTA 模型，包括 GPT-4V。