Large vision-language models (LVLMs) suffer from hallucination a lot,
generating responses that apparently contradict to the image content
occasionally. The key problem lies in its weak ability to comprehend detailed
content in a multi-modal context, which can be mainly attributed to two factors
in training data and loss function. The vision instruction dataset primarily
focuses on global description, and the auto-regressive loss function favors
text modeling rather than image understanding. In this paper, we bring more
detailed vision annotations and more discriminative vision models to facilitate
the training of LVLMs, so that they can generate more precise responses without
encounter hallucination. On one hand, we generate image-text pairs with
detailed relationship annotations in panoptic scene graph dataset (PSG). These
conversations pay more attention on detailed facts in the image, encouraging
the model to answer questions based on multi-modal contexts. On the other hand,
we integrate SAM and mask prediction loss as auxiliary supervision, forcing the
LVLMs to have the capacity to identify context-related objects, so that they
can generate more accurate responses, mitigating hallucination. Moreover, to
provide a deeper evaluation on the hallucination in LVLMs, we propose a new
benchmark, RAH-Bench. It divides vision hallucination into three different
types that contradicts the image with wrong categories, attributes or
relations, and introduces False Positive Rate as detailed sub-metric for each
type. In this benchmark, our approach demonstrates an +8.4% enhancement
compared to original LLaVA and achieves widespread performance improvements
across other models.

通过引入更详细的视觉注释和更具区分性的视觉模型来提高大型视觉语言模型的训练，使其能够生成更精确的回答，减少幻觉；此外，提出了新的评估基准 RAH-Bench 分为三种不同的幻觉类型，与原始 LLaVA 相比，我们的方法在该基准下实现了 +8.4% 的改进，并在其他模型上取得了广泛的性能提升。

用视觉监督减轻视觉 - 语言模型中的虚构问题

Mitigating Hallucination in Visual Language Models with Visual  Supervision

Fake news detection aims to detect fake news widely spreading on social media
platforms, which can negatively influence the public and the government. Many
approaches have been developed to exploit relevant information from news
images, text, or videos. However, these methods may suffer from the following
limitations: (1) ignore the inherent emotional information of the news, which
could be beneficial since it contains the subjective intentions of the authors;
(2) pay little attention to the relation (similarity) between the title and
textual information in news articles, which often use irrelevant title to
attract reader' attention. To this end, we propose a novel Title-Text
similarity and emotion-aware Fake news detection (TieFake) method by jointly
modeling the multi-modal context information and the author sentiment in a
unified framework. Specifically, we respectively employ BERT and ResNeSt to
learn the representations for text and images, and utilize publisher emotion
extractor to capture the author's subjective emotion in the news content. We
also propose a scale-dot product attention mechanism to capture the similarity
between title features and textual features. Experiments are conducted on two
publicly available multi-modal datasets, and the results demonstrate that our
proposed method can significantly improve the performance of fake news
detection. Our code is available at this https URL

文章提出了 TieFake 方法，利用 BERT 和 ResNeSt 进行生成文字和图像的表示，采用 publisher 情感提取器捕获新闻内容中作者的主观情感，并提出了一个数字点积注意机制来捕获标题特征与文字特征之间的相似性，用于检测社交媒体上的假新闻，并在两个数据集上进行了实验证明其有效性。

TieFake：标题 - 文本相似度和情感感知假新闻检测

TieFake: Title-Text Similarity and Emotion-Aware Fake News Detection

Referring image segmentation is a fundamental vision-language task that aims
to segment out an object referred to by a natural language expression from an
image. One of the key challenges behind this task is leveraging the referring
expression for highlighting relevant positions in the image. A paradigm for
tackling this problem is to leverage a powerful vision-language ("cross-modal")
decoder to fuse features independently extracted from a vision encoder and a
language encoder. Recent methods have made remarkable advancements in this
paradigm by exploiting Transformers as cross-modal decoders, concurrent to the
Transformer's overwhelming success in many other vision-language tasks.
Adopting a different approach in this work, we show that significantly better
cross-modal alignments can be achieved through the early fusion of linguistic
and visual features in intermediate layers of a vision Transformer encoder
network. By conducting cross-modal feature fusion in the visual feature
encoding stage, we can leverage the well-proven correlation modeling power of a
Transformer encoder for excavating helpful multi-modal context. This way,
accurate segmentation results are readily harvested with a light-weight mask
predictor. Without bells and whistles, our method surpasses the previous
state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

本研究提出一种新的方法，在视觉 Transformer 编码器网络的中间层通过对语言和视觉特征进行交叉融合，实现更好的交叉模态对齐，进而通过轻量级的掩模预测器得到准确的分割结果，该方法在 RefCOCO、RefCOCO + 和 G-Ref 数据集上均超越了以往的最优方法。