Recent advancements in large vision-language models (LVLMs), such as GPT4-V
and LLaVA, have been substantial. LLaVA's modular architecture, in particular,
offers a blend of simplicity and efficiency. Recent works mainly focus on
introducing more pre-training and instruction tuning data to improve model's
performance. This paper delves into the often-neglected aspects of data
efficiency during pre-training and the selection process for instruction tuning
datasets. Our research indicates that merely increasing the size of
pre-training data does not guarantee improved performance and may, in fact,
lead to its degradation. Furthermore, we have established a pipeline to
pinpoint the most efficient instruction tuning (SFT) dataset, implying that not
all SFT data utilized in existing studies are necessary. The primary objective
of this paper is not to introduce a state-of-the-art model, but rather to serve
as a roadmap for future research, aiming to optimize data usage during
pre-training and fine-tuning processes to enhance the performance of
vision-language models.

该论文研究了大型视觉语言模型（LVLMs）中数据效率的常常被忽视的方面，以及预训练和微调数据的选择过程，旨在优化数据使用来增强视觉语言模型的性能。

重新思考视觉语言模型中被忽视的方面

Rethinking Overlooked Aspects in Vision-Language Models

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual
Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision
Language Models (VLMs) on 5 tasks related to three foundational network science
concepts: identifying nodes of maximal degree on a rendered graph, identifying
whether signed triads are balanced or unbalanced, and counting components. The
tasks are structured to be easy for a human who understands the underlying
graph theoretic concepts, and can all be solved by counting the appropriate
elements in graphs. We find that while GPT-4 consistently outperforms LLaVa,
both models struggle with every visual network analysis task we propose. We
publicly release the first benchmark for the evaluation of VLMs on foundational
VNA tasks.

评估了 GPT-4 和 LLaVa 在小规模图上执行简单的视觉网络分析任务的零样本能力，并发现尽管 GPT-4 始终优于 LLaVa，但两个模型在所有提出的任务中都难以解决每个视觉网络分析任务，我们还公开发布了首个基于视觉网络分析任务评估 VLMs 的基准。

多模态 LLMs 在基础视觉网络分析中的挑战：VNA 基准测试

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA  Benchmark

Large multimodal models (LMM) have recently shown encouraging progress with
visual instruction tuning. In this note, we show that the fully-connected
vision-language cross-modal connector in LLaVA is surprisingly powerful and
data-efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA
data with simple response formatting prompts, we establish stronger baselines
that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint
uses merely 1.2M publicly available data, and finishes full training in ~1 day
on a single 8-A100 node. We hope this can make state-of-the-art LMM research
more accessible. Code and model will be publicly available.

通过对 LLaVA 进行简单修改，采用 CLIP-ViT-L-336px 与 MLP 投影以及添加面向学术任务的 VQA 数据，我们建立了更强的基线模型，在 11 个基准测试中达到了最新的成果。

通过视觉指导优化的改进基准模型

Improved Baselines with Visual Instruction Tuning

In this work, we present SciGraphQA, a synthetic multi-turn question-answer
dataset related to academic graphs. SciGraphQA is 13 times larger than
ChartVQA, the previously largest chart-visual question-answering dataset. It is
also the largest open-sourced chart VQA dataset with non-synthetic charts. To
build our dataset, we selected 290,000 Computer Science or Machine Learning
ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate
295K samples of open-vocabulary multi-turn question-answering dialogues about
the graphs. As context, we provided the text-only Palm-2 with paper title,
abstract, paragraph mentioning the graph, and rich text contextual data from
the graph itself, obtaining dialogues with an average 2.23 question-answer
turns for each graph. We asked GPT-4 to assess the matching quality of our
question-answer turns given the paper's context, obtaining an average rating of
8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most
popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our
dataset, finding LLaVA-13B being the most performant with a CIDEr score of
0.08. We further enriched the question prompts for LLAVA by including the
serialized data tables extracted from the graphs using the DePlot model,
boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset,
we also fine-tuned LLaVa using our dataset, reaching a substantially higher
CIDEr score of 0.26. We anticipate further accuracy improvement by including
segmentation mask tokens and leveraging larger LLM backbones coupled with
emergent prompting techniques. Our code and data are open-sourced.

本文介绍了 SciGraphQA，这是一个与学术图表相关的合成多轮问答数据集，它是迄今为止最大的非合成图表视觉问答数据集，使用 Palm-2 从计算机科学和机器学习 ArXiv 论文中生成了 295K 个开放式多轮问答对话样本，并通过 GPT-4 评估了问题 - 回答的匹配质量。最后，通过利用从图表中提取的序列化数据表格和 DePlot 模型，使用 LLaVA-13B 进行了进一步的改进，最终的评估 CIDEr 为 0.26。