Instruction tuning large language models (LLMs) using machine-generated
instruction-following data has improved zero-shot capabilities on new tasks,
but the idea is less explored in the multimodal field. In this paper, we
present the first attempt to use language-only GPT-4 to generate multimodal
language-image instruction-following data. By instruction tuning on such
generated data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision encoder and
LLM for general-purpose visual and language understanding.Our early experiments
show that LLaVA demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and
yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal
instruction-following dataset. When fine-tuned on Science QA, the synergy of
LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make
GPT-4 generated visual instruction tuning data, our model and code base
publicly available.

本文利用语言模型 GPT-4 生成多模态图文指令序列来优化多模态模型，得到了新的模型 LLaVA 并在多个数据集上表现出色。

视觉指令调整

Visual Instruction Tuning

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike
standard contrastive learning with softmax normalization, the sigmoid loss
operates solely on image-text pairs and does not require a global view of the
pairwise similarities for normalization. The sigmoid loss simultaneously allows
further scaling up the batch size, while also performing better at smaller
batch sizes. With only four TPUv4 chips, we can train a Base CLIP model at 4k
batch size and a Large LiT model at 20k batch size, the latter achieves 84.5%
ImageNet zero-shot accuracy in two days. This disentanglement of the batch size
from the loss further allows us to study the impact of examples vs pairs and
negative to positive ratio. Finally, we push the batch size to the extreme, up
to one million, and find that the benefits of growing batch size quickly
diminish, with a more reasonable batch size of 32k being sufficient. We hope
our research motivates further explorations in improving the quality and
efficiency of language-image pre-training.

本文提出了针对图像 - 文本预训练的简单对数损失函数，其中使用 sigmoid 函数，从而可以放大批量大小，并实现更好的性能表现。