This paper makes the first attempt towards unsupervised preference alignment
in Vision-Language Models (VLMs). We generate chosen and rejected responses
with regard to the original and augmented image pairs, and conduct preference
alignment with direct preference optimization. It is based on a core idea:
properly designed augmentation to the image input will induce VLM to generate
false but hard negative responses, which helps the model to learn from and
produce more robust and powerful answers. The whole pipeline no longer hinges
on supervision from GPT4 or human involvement during alignment, and is highly
efficient with few lines of code. With only 8k randomly sampled unsupervised
data, it achieves 90\% relative score to GPT-4 on complex reasoning in
LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex
multi-modal benchmark MM-Vet. Visualizations shows its improved ability to
align with user-intentions. A series of ablations are firmly conducted to
reveal the latent mechanism of the approach, which also indicates its potential
towards further scaling. Code will be available.

本研究首次尝试了视觉语言模型（VLMs）中的无监督偏好对齐，通过对原始和增强图像对生成选择和拒绝响应，并进行直接偏好优化来实现。通过合理设计图像输入的增强方式，诱导 VLM 生成虚假但困难的负面响应，有助于模型从中学习并生成更强大和健壮的答案。整个流程不再依赖于 GPT4 的监督或人工参与对齐，具有高效和简洁的代码。通过仅使用 8k 个随机采样的无监督数据，在复杂推理的 LLaVA-Bench 上相对于 GPT-4 达到 90％的相对分数，并在复杂多模态基准 MM-Vet 上提高 LLaVA-7B/13B 的分数 6.7％/5.6％。可视化结果显示它对齐用户意图的能力得到了改善。作者进行了一系列消融实验以揭示该方法的潜在机制，并表明其进一步扩展的潜力。代码将会提供。