The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.

我们介绍了Draw-and-Understand项目，其中包括一种新的多模域数据集和一种具有挑战性的视觉提示基准测试。我们提出了SPHINX-V，一种新的端到端训练的多模域大型语言模型，用于连接视觉编码器、视觉提示编码器和语言理解模型，以实现各种视觉提示和语言理解。同时，我们还提出了MDVP-Data和MDVP-Bench，用于促进多模域大型语言模型中的视觉提示研究，并提供了多领域数据集和具有挑战性的基准测试。我们的实验结果表明，SPHINX-V通过视觉提示展现出了卓越的多模交互能力，并在详细的像素级描述和问答能力方面取得了显著的改进。

绘制与理解：利用视觉提示使MLLMs能够理解您想要的内容