The growth of social media, characterized by its multimodal nature, has led
to the emergence of diverse phenomena and challenges, which calls for an
effective approach to uniformly solve automated tasks. The powerful Large
Vision Language Models make it possible to handle a variety of tasks
simultaneously, but even with carefully designed prompting methods, the general
domain models often fall short in aligning with the unique speaking style and
context of social media tasks. In this paper, we introduce a Large Vision
Language Model for Social Media Processing (SoMeLVLM), which is a cognitive
framework equipped with five key capabilities including knowledge &
comprehension, application, analysis, evaluation, and creation. SoMeLVLM is
designed to understand and generate realistic social media behavior. We have
developed a 654k multimodal social media instruction-tuning dataset to support
our cognitive framework and fine-tune our model. Our experiments demonstrate
that SoMeLVLM achieves state-of-the-art performance in multiple social media
tasks. Further analysis shows its significant advantages over baselines in
terms of cognitive abilities.

通过介绍一种用于社交媒体处理的大型视觉语言模型（SoMeLVLM），该模型具备知识与理解、应用、分析、评估和创造等五个关键能力，在处理多种社交媒体任务方面取得了最先进的性能。

SoMeLVLM：用于社交媒体处理的大型视觉语言模型

SoMeLVLM: A Large Vision Language Model for Social Media Processing

The rise of multimodal misinformation on social platforms poses significant
challenges for individuals and societies. Its increased credibility and broader
impact compared to textual misinformation make detection complex, requiring
robust reasoning across diverse media types and profound knowledge for accurate
verification. The emergence of Large Vision Language Model (LVLM) offers a
potential solution to this problem. Leveraging their proficiency in processing
visual and textual information, LVLM demonstrates promising capabilities in
recognizing complex information and exhibiting strong reasoning skills. In this
paper, we first investigate the potential of LVLM on multimodal misinformation
detection. We find that even though LVLM has a superior performance compared to
LLMs, its profound reasoning may present limited power with a lack of evidence.
Based on these observations, we propose LEMMA: LVLM-Enhanced Multimodal
Misinformation Detection with External Knowledge Augmentation. LEMMA leverages
LVLM intuition and reasoning capabilities while augmenting them with external
knowledge to enhance the accuracy of misinformation detection. Our method
improves the accuracy over the top baseline LVLM by 7% and 13% on Twitter and
Fakeddit datasets respectively.

利用 Large Vision Language Model（LVLM）及外部知识增强的 LEMMA 方法，大幅提升了多模态错误信息检测的准确性。

LEMMA: 用外部知识增强的 LVLM 增强型多模态误信息检测

LEMMA: Towards LVLM-Enhanced Multimodal Misinformation Detection with  External Knowledge Augmentation

Finetuning a large vision language model (VLM) on a target dataset after
large scale pretraining is a dominant paradigm in visual question answering
(VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non
natural-image domains are orders of magnitude smaller than those for
general-purpose VQA. While collecting additional labels for specialized tasks
or domains can be challenging, unlabeled images are often available. We
introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning
large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset
to build a teacher model that can generate question-answer pseudolabels
directly conditioned on an image alone, allowing us to pseudolabel unlabeled
images. SelTDA then finetunes the initial VLM on the original dataset augmented
with freshly pseudolabeled images. We describe a series of experiments showing
that our self-taught data augmentation increases robustness to adversarially
searched questions, counterfactual examples and rephrasings, improves domain
generalization, and results in greater retention of numerical reasoning skills.
The proposed strategy requires no additional annotations or architectural
modifications, and is compatible with any modern encoder-decoder multimodal
transformer. Code available at this https URL

本文介绍了一种自学习数据增强策略，可以在小规模的视觉问答数据集上优化大视觉语言模型，从而增强对对抗性搜索、反事实例子和重述的鲁棒性，提高领域泛化能力，并保留更多数字推理技能。