Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry.

本文通过使用预训练的VLM（基础视觉语言模型）进行第一次实证研究图片广告的理解。在此过程中，我们发现了适应这些VLM到图像广告理解中的实际挑战，并提出了一种简单的特征适应策略来有效融合图像广告的多模态信息，并进一步强化其对真实世界实体的知识。希望我们的研究能引起更多人对与广告行业广泛相关的图片广告理解的关注。

KAFA: 基于知识增强特征自适应的视觉语言模型重构图像广告理解