BriefGPT.xyz
Aug, 2023
多模态大语言模型的位置增强视觉指令调整
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
HTML
PDF
Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li...
TL;DR
通过引入区域级别的视觉编码器,本文提出了一种增强图像教学调整功能的多模态大型语言模型(MLLMs),以实现更细粒度的模态交叉对齐,并设计了多种数据生成策略构建了图像-区域-语言指令数据集,实验结果表明该模型的卓越性能。
Abstract
Recently,
multimodal large language models
(MLLMs) that enable Large Language Models (LLMs) to interpret images through
visual instruction tuning
have achieved significant success. However, existing
→