Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruction-tuning for biomedical contexts using large datasets such as PMC-15M. Our research offers three key contributions: (i) we present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B, (ii) we propose a novel image encoding strategy using hierarchical representations to improve fine-grained biomedical visual comprehension, and (iii) we develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question answering benchmarks, with an average performance improvement of over 10% compared to previous methods. These advancements provide more accurate and reliable tools for medical professionals, bridging gaps in current multi-modal conversational assistants and promoting further innovations in medical AI.

我们的研究在生物医学领域提出了一个新的指导数据集，利用医学图像文本对，提出了一种新的图像编码策略，通过使用分层表示改善了精细的生物医学视觉理解，并且开发了LLama3-Med模型，在生物医学视觉问答基准测试中实现了最先进的零-shot性能，相比于以前的方法，平均性能提高超过10％，这些进展为医疗专业人员提供了更准确可靠的工具，弥补了当前多模态对话助手中的差距，并促进了医疗人工智能的进一步创新。

推进生物医学中高分辨率视觉语言模型