Large Vision-Language Models (LVLMs) are increasingly adept at generating
contextually detailed and coherent responses from visual inputs. However, their
application in multimodal decision-making and open-ended generation is hindered
by a notable rate of hallucinations, where generated text inaccurately
represents the visual contents. To address this issue, this paper introduces
the Instruction Contrastive Decoding (ICD) method, a novel approach designed to
reduce hallucinations during LVLM inference. Our method is inspired by our
observation that what we call disturbance instructions significantly exacerbate
hallucinations in multimodal fusion modules. ICD contrasts distributions from
standard and instruction disturbance, thereby increasing alignment uncertainty
and effectively subtracting hallucinated concepts from the original
distribution. Through comprehensive experiments on discriminative benchmarks
(POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that
ICD significantly mitigates both object-level and attribute-level
hallucinations. Moreover, our method not only addresses hallucinations but also
significantly enhances the general perception and recognition capabilities of
LVLMs.

大规模视觉 - 语言模型（LVLMs）在从视觉输入中生成上下文详细和连贯的回答方面越来越熟练。然而，它们在多模态决策和开放性生成方面的应用受到幻觉的明显影响，生成的文本不准确地表示了视觉内容。为解决这个问题，本文引入了 Instruction Contrastive Decoding（ICD）方法，这是一种旨在减少 LVLM 推断过程中幻觉的新方法。我们的方法受到了干扰指令明显加剧多模态融合模块幻觉的观察启发。ICD 对标准和干扰指令的分布进行对比，从而增加了对齐的不确定性，并有效地从原始分布中减去了幻觉概念。通过对鉴别性基准（POPE 和 MME）和生成基准（LLaVa-Bench）进行全面实验，我们证明了 ICD 显著减轻了物体级幻觉和属性级幻觉。此外，我们的方法不仅解决了幻觉问题，还显著提升了 LVLM 的一般感知和识别能力。

利用指导对比解码减轻大型视觉语言模型中的幻觉

Mitigating Hallucinations in Large Vision-Language Models with  Instruction Contrastive Decoding

Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has
gained significant attention among scholars. However, current multimodal models
have reached a performance bottleneck. To investigate the causes of this
problem, we perform extensive empirical evaluation and in-depth analysis of the
datasets to answer the following questions: Q1: Are the modalities equally
important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3:
Do existing datasets adequately support the research? Our experiments and
analyses reveal that the current TMSC systems primarily rely on the textual
modality, as most of targets' sentiments can be determined solely by text.
Consequently, we point out several directions to work on for the TMSC task in
terms of model design and dataset construction. The code and data can be found
in this https URL

研究了目标导向的多模态情感分类中当前的性能瓶颈问题，通过实证评估和深入分析数据集，揭示了当前多模态情感分类系统主要依赖文本模态，提出了关于模型设计和数据集构建的几个方向。