Large Multimodal Models have demonstrated impressive capabilities in
understanding general vision-language tasks. However, due to the limitation of
supported input resolution (e.g., 448 x 448) as well as the inexhaustive
description of the training image-text pair, these models often encounter
challenges when dealing with intricate scene understandings and narratives.
Here we address the problem by proposing the Monkey. Our contributions are
two-fold: 1) without pretraining from the start, our method can be built upon
an existing vision encoder (e.g., vit-BigHuge) to effectively improve the input
resolution capacity up to 896 x 1344 pixels; 2) we propose a multi-level
description generation method, which automatically provides rich information
that can guide model to learn contextual association between scenes and
objects. Our extensive testing across more than 16 distinct datasets reveals
that Monkey achieves consistently competitive performance over the existing
LMMs on fundamental tasks, such as Image Captioning, General Visual Question
Answering (VQA), and Document-oriented VQA. Models, interactive demo, and the
source code are provided at the following
this https URL

本研究提出了一种名为 Monkey 的多模态模型，可以提高输入分辨率，并通过多级描述生成方法，提供丰富的信息以帮助模型学习场景和物体之间的上下文关联。在广泛的测试中，Monkey 在图像字幕生成、通用视觉问答和面向文档的视觉问答等基本任务上展现了竞争性的性能。