State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated
exceptional capabilities in vision-language tasks. Despite their advanced
functionalities, the performances of LMMs are still limited in challenging
scenarios that require complex reasoning with multiple levels of visual
information. Existing prompting techniques for LMMs focus on either improving
textual reasoning or leveraging tools for image preprocessing, lacking a simple
and general visual prompting scheme to promote vision-language coordination in
LMMs. In this work, we propose Scaffold prompting that scaffolds coordinates to
promote vision-language coordination. Specifically, Scaffold overlays a dot
matrix within the image as visual information anchors and leverages
multi-dimensional coordinates as textual positional references. Extensive
experiments on a wide range of challenging vision-language tasks demonstrate
the superiority of Scaffold over GPT-4V with the textual CoT prompting. Our
code is released in this https URL

提出了一种 Scaffold 提示方案，通过在图像中覆盖点矩阵作为视觉信息锚点，并利用多维坐标作为文本位置参考，以促进视觉 - 语言协调，实验证明其在复杂的视觉 - 语言任务中优于 GPT-4V 与文本的 CoT 提示。