Text-centric visual question answering (VQA) has made great strides with the
development of Multimodal Large Language Models (MLLMs), yet open-source models
still fall short of leading models like GPT4V and Gemini, partly due to a lack
of extensive, high-quality instruction tuning data. To this end, we introduce a
new approach for creating a massive, high-quality instruction-tuning dataset,
Square-10M, which is generated using closed-source MLLMs. The data construction
process, termed Square, consists of four steps: Self-Questioning, Answering,
Reasoning, and Evaluation. Our experiments with Square-10M led to three key
findings: 1) Our model, TextSquare, considerably surpasses open-source previous
state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%).
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10
text-centric benchmarks. 2) Additionally, we demonstrate the critical role of
VQA reasoning data in offering comprehensive contextual insights for specific
questions. This not only improves accuracy but also significantly mitigates
hallucinations. Specifically, TextSquare scores an average of 75.1% across four
general VQA and hallucination evaluation datasets, outperforming previous
state-of-the-art models. 3) Notably, the phenomenon observed in scaling
text-centric VQA datasets reveals a vivid pattern: the exponential increase of
instruction tuning data volume is directly proportional to the improvement in
model performance, thereby validating the necessity of the dataset scale and
the high quality of Square-10M.

TextSquare 通过使用 Square-10M 数据集，远远超过开源模型，提出了对文本中心的 MLLMs 进行调参的新方法，并在 OCR 评估中达到了新的标准 (62.2%)，同时在 6 个文本中心基准测试中胜过 GPT4V 和 Gemini 模型。此外，研究还展示了 VQA 推理数据在提供全面上下文洞察力方面的关键作用，并提高了准确性，显著减轻了幻觉。最后，研究揭示了文本中心 VQA 数据集规模的指数级增长与模型性能改善之间的关系，验证了数据集规模和 Square-10M 的高质量的必要性。

TextSquare：文本为中心的视觉指令调优的扩展

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

With contributions from the open-source community, a vast amount of
instruction tuning (IT) data has emerged. Given the significant resource
allocation required by training and evaluating models, it is advantageous to
have an efficient method for selecting high-quality IT data. However, existing
methods for instruction data selection have limitations such as relying on
fragile external APIs, being affected by biases in GPT models, or reducing the
diversity of the selected instruction dataset. In this paper, we propose an
industrial-friendly, expert-aligned and diversity-preserved instruction data
selection method: Clustering and Ranking (CaR). CaR consists of two steps. The
first step involves ranking instruction pairs using a scoring model that is
well aligned with expert preferences (achieving an accuracy of 84.25%). The
second step involves preserving dataset diversity through a clustering
process.In our experiment, CaR selected a subset containing only 1.96% of
Alpaca's IT data, yet the underlying AlpaCaR model trained on this subset
outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. Furthermore,
our method utilizes small models (355M parameters) and requires only 11.2% of
the monetary cost compared to existing methods, making it easily deployable in
industrial scenarios.

使用 Clustering and Ranking (CaR) 方法，根据专家偏好选择高质量的指导数据，通过提供数据集多样性来确保高质量数据选择，实验结果表明，CaR 方法仅使用 1.96% 的 Alpaca 的 IT 数据子集，训练的 AlpaCaR 模型在 GPT-4 评估中平均优于 Alpaca 32.1%，同时该方法只需要使用 355M 参数的小型模型，并且比现有方法仅需 11.2% 的经济成本，适用于工业场景。

聚类和排名：通过专家对齐质量估计实现保留多样性的指令选择

Clustering and Ranking: Diversity-preserved Instruction Selection  through Expert-aligned Quality Estimation

We present Virtual Prompt Injection (VPI) for instruction-tuned Large
Language Models (LLMs). VPI allows an attacker-specified virtual prompt to
steer the model behavior under specific trigger scenario without any explicit
injection in model input. For instance, if an LLM is compromised with the
virtual prompt "Describe Joe Biden negatively." for Joe Biden-related
instructions, then any service deploying this model will propagate biased views
when handling user queries related to Joe Biden. VPI is especially harmful for
two primary reasons. Firstly, the attacker can take fine-grained control over
LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency
in following instructions. Secondly, this control is achieved without any
interaction from the attacker while the model is in service, leading to
persistent attack. To demonstrate the threat, we propose a simple method for
performing VPI by poisoning the model's instruction tuning data. We find that
our proposed method is highly effective in steering the LLM with VPI. For
example, by injecting only 52 poisoned examples (0.1% of the training data
size) into the instruction tuning data, the percentage of negative responses
given by the trained model on Joe Biden-related queries change from 0% to 40%.
We thus highlight the necessity of ensuring the integrity of the
instruction-tuning data as little poisoned data can cause stealthy and
persistent harm to the deployed model. We further explore the possible defenses
and identify data filtering as an effective way to defend against the poisoning
attacks. Our project page is available at this https URL

我们提出了虚拟提示注入（VPI）技术，用于调整指令的大型语言模型（LLM）。VPI 允许攻击者指定虚拟提示，在特定触发场景下引导模型行为，而无需显式地注入模型输入。我们通过污染模型的指令调整数据，演示了 VPI 的风险，并建议采用数据过滤作为一种有效的防御手段。