BriefGPT.xyz
Mar, 2024
美餐之眼: 多模态大型语言模型的分辨率混合适应
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
HTML
PDF
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun...
TL;DR
基于图像分辨率的新型多模态大语言模型方法(LLaVA-HR)通过采用低分辨率和高分辨率图像特征的组合有效地改善了视觉识别的问题,在11个视觉-语言任务中表现出比现有模型更好的性能。
Abstract
Despite remarkable progress, existing
multimodal large language models
(MLLMs) are still inferior in
granular visual recognition
. Contrary to previous works, we study this problem from the perspective of
→