Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) but still face challenges in multimodal reasoning. Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. However, CoT-based multimodal reasoning often demands costly data annotation and fine-tuning, while agentic approaches relying on external tools risk introducing unreliable output from these tools. In this paper, we propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework that integrates external vision models with uncertainty quantification (UQ) into an MLLM to address these challenges. Specifically, SRICE guides the inference process by allowing MLLM to autonomously select regions of interest through multi-stage interactions with the help of external tools. We propose to use a conformal prediction-based approach to calibrate the output of external tools and select the optimal tool by estimating the uncertainty of an MLLM's output. Our experiment shows that the average improvement of SRICE over the base MLLM is 4.6% on five datasets and the performance on some datasets even outperforms fine-tuning-based methods, revealing the significance of ensuring reliable tool use in an MLLM agent.

本研究解决了多模态大语言模型在多模态推理中的挑战，特别是依赖昂贵的数据标注和外部工具的潜在不可靠性。我们提出的SRICE框架，通过集成不确定性感知，允许模型自主选择感兴趣的区域，从而提高了推理过程中的可靠性和效率。实验结果显示，SRICE在多个数据集上的平均性能提高了4.6%，并在部分数据集上表现超越了基于微调的方法。

用信心观察和推理：通过不确定性感知的自主框架增强多模态大语言模型