We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

该研究解决了描述性合成字幕与事实性网页规模替代文本之间的差距，提出了一种新的KALE数据集，该数据集包含2.18亿对图像-文本对。通过结合合成密集图像字幕和网页规模替代文本的两阶段方法，生成了具备事实依据的图像字幕，实验表明KALE数据集能显著提升多模态模型的能力和知识水平。

BLIP3-KALE：知识增强的大规模密集字幕