Multi-modal information retrieval (MMIR) is a rapidly evolving field, where
significant progress, particularly in image-text pairing, has been made through
advanced representation learning and cross-modality alignment research.
However, current benchmarks for evaluating MMIR performance in image-text
pairing within the scientific domain show a notable gap, where chart and table
images described in scholarly language usually do not play a significant role.
To bridge this gap, we develop a specialised scientific MMIR (SciMMIR)
benchmark by leveraging open-access paper collections to extract data relevant
to the scientific domain. This benchmark comprises 530K meticulously curated
image-text pairs, extracted from figures and tables with detailed captions in
scientific documents. We further annotate the image-text pairs with two-level
subset-subcategory hierarchy annotations to facilitate a more comprehensive
evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations
on prominent multi-modal image-captioning and visual language models, such as
CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific
domain, including the impact of pre-training and fine-tuning settings and the
influence of the visual and textual encoders. All our data and checkpoints are
publicly available at this https URL

通过高级表示学习和跨模态对齐研究，在图像 - 文本匹配方面取得了显著进展。为了弥补科学领域中目前评估图像 - 文本匹配性能的不足，我们开发了一种专门的科学多模态信息检索（SciMMIR）基准，利用开放获取的论文集提取与科学领域相关的数据，包括从科学文档中提取的详细标题的图表图像对，并对其进行了两级子集 - 子类别层次注释，以便更全面地评估基线模型。我们对重要的多模态图像字幕生成和视觉语言模型（如 CLIP 和 BLIP）进行了零样本和微调评估，分析结果为科学领域的多模态信息检索提供了关键洞察，包括预训练和微调设置的影响以及视觉和文本编码器的影响。所有我们的数据和检查点都可以在该 URL 中公开获取。

SciMMIR：科学多模态信息检索的基准评测

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such
as directly generating websites from handwritten text and identifying humorous
elements within images. These features are rarely observed in previous
vision-language models. We believe the primary reason for GPT-4's advanced
multi-modal generation capabilities lies in the utilization of a more advanced
large language model (LLM). To examine this phenomenon, we present MiniGPT-4,
which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one
projection layer. Our findings reveal that MiniGPT-4 possesses many
capabilities similar to those exhibited by GPT-4 like detailed image
description generation and website creation from hand-written drafts.
Furthermore, we also observe other emerging capabilities in MiniGPT-4,
including writing stories and poems inspired by given images, providing
solutions to problems shown in images, teaching users how to cook based on food
photos, etc. In our experiment, we found that only performing the pretraining
on raw image-text pairs could produce unnatural language outputs that lack
coherency including repetition and fragmented sentences. To address this
problem, we curate a high-quality, well-aligned dataset in the second stage to
finetune our model using a conversational template. This step proved crucial
for augmenting the model's generation reliability and overall usability.
Notably, our model is highly computationally efficient, as we only train a
projection layer utilizing approximately 5 million aligned image-text pairs.
Our code, pre-trained model, and collected dataset are available at
this https URL

本文介绍了 MiniGPT-4 模型，该模型利用像 GPT-4 这样的先进的大型语言模型（LLM）与视觉编码器对齐，可以生成详细的图像描述和从手写草图中创建网站等多重能力，采用对齐的图文数据集训练可以提高生成的可靠性和整体可用性。