Preference modeling techniques, such as direct preference optimization (DPO),
has shown effective in enhancing the generalization abilities of large language
model (LLM). However, in tasks involving video instruction-following, providing
informative feedback, especially for detecting hallucinations in generated
responses, remains a significant challenge. Previous studies have explored
using large large multimodal models (LMMs) as reward models to guide preference
modeling, but their ability to accurately assess the factuality of generated
responses compared to corresponding videos has not been conclusively
established. This paper introduces a novel framework that utilizes detailed
video captions as a proxy of video content, enabling language models to
incorporate this information as supporting evidence for scoring video Question
Answering (QA) predictions. Our approach demonstrates robust alignment with
OpenAI GPT-4V model's reward mechanism, which directly takes video frames as
input. Furthermore, we show that applying this tailored reward through DPO
significantly improves the performance of video LMMs on video QA tasks.

本文介绍了一种新的框架，利用详细的视频字幕作为视频内容的代理，使语言模型能够将此信息作为支持证据，用于评分视频问答（QA）预测，并通过直接将视频帧作为输入的 OpenAI GPT-4V 模型的奖励机制来展示我们的方法与之间的稳健一致性。此外，我们还表明通过直接偏好优化使用此定制奖励显著改善了视频语言模型在视频 QA 任务上的性能。

直接优化语言模型奖励的视频大型多模态模型

Direct Preference Optimization of Video Large Multimodal Models from  Language Model Reward

In recent years, Large Language Models (LLMs) have gained immense attention
due to their notable emergent capabilities, surpassing those seen in earlier
language models. A particularly intriguing application of LLMs is their role as
evaluators for texts produced by various generative models.
In this study, we delve into the potential of LLMs as reliable assessors of
factual consistency in summaries generated by text-generation models.
Initially, we introduce an innovative approach for factuality assessment using
LLMs. This entails employing a singular LLM for the entirety of the
question-answering-based factuality scoring process. Following this, we examine
the efficacy of various LLMs in direct factuality scoring, benchmarking them
against traditional measures and human annotations.
Contrary to initial expectations, our results indicate a lack of significant
correlations between factuality metrics and human evaluations, specifically for
GPT-4 and PaLM-2. Notable correlations were only observed with GPT-3.5 across
two factuality subcategories. These consistent findings across various factual
error categories suggest a fundamental limitation in the current LLMs'
capability to accurately gauge factuality.
This version presents the information more concisely while maintaining the
main points and findings of the original text.

本研究旨在探讨大型语言模型作为可靠的评估器，用于评估文本生成模型生成的摘要的事实一致性，并发现其在事实性评分中的局限性。