LLaVA-Interactive is a research prototype for multimodal human-AI
interaction. The system can have multi-turn dialogues with human users by
taking multimodal user inputs and generating multimodal responses. Importantly,
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled
to align human intents in the interaction. The development of LLaVA-Interactive
is extremely cost-efficient as the system combines three multimodal skills of
pre-built AI models without additional model training: visual chat of LLaVA,
image segmentation from SEEM, as well as image generation and editing from
GLIGEN. A diverse set of application scenarios is presented to demonstrate the
promises of LLaVA-Interactive and to inspire future research in multimodal
interactive systems.

LLaVA-Interactive 是一种用于多模态人工智能交互的研究原型系统，其能够通过获取多模态用户输入和生成多模态响应与用户进行多轮对话。该系统具有可视化提示功能，以对齐人类意图，同时还结合了 LLaVA 的可视化聊天、SEEM 的图像分割以及 GLIGEN 的图像生成和编辑等三种多模态技能，从而具备了成本极低的开发。该论文通过展示多样的应用场景，展示了 LLaVA-Interactive 系统的潜力，并激发了未来研究多模态交互系统的灵感。