In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

介绍了LLaVA-Phi，一种高效的多模态助手，利用最近先进的小型语言模型Phi-2的强大能力来促进多模态对话，并展示了即使是参数只有27亿的小型语言模型，只要用高质量的语料库进行训练，也能在集成文本和视觉元素的复杂对话中有效参与。该模型在公开可用的视觉理解、推理和基于知识的感知的基准测试中表现出色。除了在多模态对话任务中取得卓越性能外，模型还为时间敏感环境和需要实时交互的系统（如具身代理）的应用开辟了新的方向，突显了小型语言模型在实现复杂的理解和交互水平时保持更高资源效率的潜力。

LLaVA-$φ$: 高效的多模态助手与小型语言模型