Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

本研究针对大型视觉语言模型（LVLM）中的幻觉问题，该问题在医疗和自主系统等关键领域中造成严重风险。提出了一种新颖的方法——补充自适应令牌级对比解码（CATCH），通过视觉信息分离、幻觉检测和令牌级对比解码，显著减少了视觉缺陷和幻觉，提高了模型在视觉问答任务中的表现，并无需特定数据或训练，具有广泛的应用潜力。

补充自适应令牌级对比解码以减轻大规模视觉语言模型中的幻觉