Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

本研究旨在解决现有大型视觉语言模型在生成超过1000单词的连贯输出时面临的挑战，主要原因是缺乏长输出样本。通过引入包含22158个示例的SFT数据集LongWriter-V-22k及Direct Preference Optimization (DPO) 方法，研究展示了如何在保持高保真的同时实现长输出。我们的7B参数模型在新开发的MMLongBench-Write基准测试上表现出色，超过了大型专有模型，如GPT-4o。

LongWriter-V：在视觉语言模型中实现超长高保真生成