Large Vision-Language Models (LVLMs) excel in integrating visual and
linguistic contexts to produce detailed content, facilitating applications such
as image captioning. However, using LVLMs to generate descriptions often faces
the challenge of object hallucination (OH), where the output text misrepresents
actual objects in the input image. While previous studies attribute the
occurrence of OH to the inclusion of more details, our study finds technical
flaws in existing metrics, leading to unreliable evaluations of models and
conclusions about OH. This has sparked a debate on the question: Do more
details always introduce more hallucinations in LVLM-based image captioning?
In this paper, we address this debate by proposing a novel decoding strategy,
Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation
metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD decodes the wealth of
information hidden in visual input into distinct language representations
called unit facts in parallel. This decoding is achieved via a well-designed
differential score that guides the parallel search and candidate screening. The
selected unit facts are then aggregated to generate the final caption. Our
proposed metrics evaluate the comprehensiveness and accuracy of image captions
by comparing the embedding groups of ground-truth image regions and generated
text partitions. Extensive experiments on the Visual Genome dataset validate
the effectiveness of our approach, demonstrating that it produces detailed
descriptions while maintaining low hallucination levels.

我们提出了一种新的解码策略，名为差异化束搜索解码（DBD），以及一组可靠的评估指标：CLIP-Precision、CLIP-Recall 和 CLIP-F1，用于图像描述。我们的方法在 Visual Genome 数据集上进行了广泛实验证明了其有效性，能够生成细节丰富的描述，并保持低的幻觉水平。

基于 LVLM 的图像描述中，更多的细节总是引入更多的幻觉吗？

Do More Details Always Introduce More Hallucinations in LVLM-based Image  Captioning?

Nowadays, the research on Large Vision-Language Models (LVLMs) has been
significantly promoted thanks to the success of Large Language Models (LLM).
Nevertheless, these Vision-Language Models (VLMs) are suffering from the
drawback of hallucination -- due to insufficient understanding of vision and
language modalities, VLMs may generate incorrect perception information when
doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, on the one hand, we introduce a
Contrastive Instruction Evaluation Method (CIEM), which is an automatic
pipeline that leverages an annotated image-text dataset coupled with an LLM to
generate factual/contrastive question-answer pairs for the evaluation of the
hallucination of VLMs. On the other hand, based on CIEM, we further propose a
new instruction tuning method called CIT (the abbreviation of Contrastive
Instruction Tuning) to alleviate the hallucination of VLMs by automatically
producing high-quality factual/contrastive question-answer pairs and
corresponding justifications for model tuning. Through extensive experiments on
CIEM and CIT, we pinpoint the hallucination issues commonly present in existing
VLMs, the disability of the current instruction-tuning dataset to handle the
hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM
and public datasets.

通过对大型视觉语言模型（LVLMs）进行研究，解决了现有视觉语言模型（VLMs）在下游应用中生成不正确感知信息的幻觉问题，利用对比指导评估方法（CIEM）和对比指导调整方法（CIT）产生高质量的问题 - 答案对和相应的理由，提高了模型的效果。