Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models, a phenomenon known as reward overoptimization. To investigate this issue in depth, we introduce the Text-Image Alignment Assessment (TIA2) benchmark, which comprises a diverse collection of text prompts, images, and human annotations. Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective. To address this, we propose TextNorm, a simple method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts. We demonstrate that incorporating the confidence-calibrated rewards in fine-tuning effectively reduces overoptimization, resulting in twice as many wins in human evaluation for text-image alignment compared against the baseline reward models.

使用人类反馈数据训练的奖励函数来微调文本到图像模型已被证明可以将模型行为与人类意图对齐。然而，过度优化这些奖励模型可能会损害微调模型的性能，这被称为奖励过度优化现象。为了深入研究这个问题，我们引入了Text-Image Alignment Assessment (TIA2)基准，该基准由各种文本提示、图像和人类注释组成。我们在这个基准上评估了几个最先进的奖励模型，发现它们与人类评估频繁不一致。我们经验证明，当使用一个不良对齐的奖励模型作为微调目标时，过度优化现象尤为严重。为了解决这个问题，我们提出了TextNorm，一种简单的方法，根据一组语义对比的文本提示来增强对齐。我们证明，在微调中整合具有置信度校准的奖励可以有效减少过度优化，相对于基线奖励模型，在文本到图像对齐的人类评估中获得了两倍的胜利。

细调文本-图像模型的自信度感知奖励优化