News image captioning task is a variant of image captioning task which
requires model to generate a more informative caption with news image and the
associated news article. Multimodal Large Language models have developed
rapidly in recent years and is promising in news image captioning task.
However, according to our experiments, common MLLMs are not good at generating
the entities in zero-shot setting. Their abilities to deal with the entities
information are still limited after simply fine-tuned on news image captioning
dataset. To obtain a more powerful model to handle the multimodal entity
information, we design two multimodal entity-aware alignment tasks and an
alignment framework to align the model and generate the news image captions.
Our method achieves better results than previous state-of-the-art models in
CIDEr score (72.33 -> 86.29) on GoodNews dataset and (70.83 -> 85.61) on
NYTimes800k dataset.

新闻图像标题任务是图像标题任务的一种变体，要求模型生成与新闻图像和相关新闻文章更相关的标题。多模态大型语言模型在近年来得到快速发展，并在新闻图像标题任务中具有良好的前景。然而，根据我们的实验，普通的多模态大型语言模型在零样本学习环境中生成实体方面的能力还十分有限。仅仅在新闻图像标题数据集上进行微调后，它们处理实体信息的能力仍然不够。为了获得一个更强大的模型来处理多模态实体信息，我们设计了两个多模态实体感知对齐任务和一个对齐框架来对齐模型并生成新闻图像标题。我们的方法在 GoodNews 数据集上的 CIDEr 分数（72.33 -> 86.29）和 NYTimes800k 数据集上的 CIDEr 分数（70.83 -> 85.61）上取得了比先前最先进模型更好的结果。

针对新闻图像字幕生成的实体感知多模态对齐框架

Entity-Aware Multimodal Alignment Framework for News Image Captioning

The goal of News Image Captioning is to generate an image caption according
to the content of both a news article and an image. To leverage the visual
information effectively, it is important to exploit the connection between the
context in the articles/captions and the images. Psychological studies indicate
that human faces in images draw higher attention priorities. On top of that,
humans often play a central role in news stories, as also proven by the
face-name co-occurrence pattern we discover in existing News Image Captioning
datasets. Therefore, we design a face-naming module for faces in images and
names in captions/articles to learn a better name embedding. Apart from names,
which can be directly linked to an image area (faces), news image captions
mostly contain context information that can only be found in the article.
Humans typically address this by searching for relevant information from the
article based on the image. To emulate this thought process, we design a
retrieval strategy using CLIP to retrieve sentences that are semantically close
to the image. We conduct extensive experiments to demonstrate the efficacy of
our framework. Without using additional paired data, we establish the new
state-of-the-art performance on two News Image Captioning datasets, exceeding
the previous state-of-the-art by 5 CIDEr points. We will release code upon
acceptance.

新闻图像字幕生成的自动化方法通过设计面部命名模块和检索策略以优化视觉信息处理，并超越之前的最佳性能表现。

新闻图片标题生成的视觉感知上下文建模

Visually-Aware Context Modeling for News Image Captioning

News Image Captioning requires describing an image by leveraging additional
context from a news article. Previous works only coarsely leverage the article
to extract the necessary context, which makes it challenging for models to
identify relevant events and named entities. In our paper, we first demonstrate
that by combining more fine-grained context that captures the key named
entities (obtained via an oracle) and the global context that summarizes the
news, we can dramatically improve the model's ability to generate accurate news
captions. This begs the question, how to automatically extract such key
entities from an image? We propose to use the pre-trained vision and language
retrieval model CLIP to localize the visually grounded entities in the news
article and then capture the non-visual entities via an open relation
extraction model. Our experiments demonstrate that by simply selecting a better
context from the article, we can significantly improve the performance of
existing models and achieve new state-of-the-art performance on multiple
benchmarks.

本文提出了利用已预训练的视觉和语言检索模型 CLIP 来定位新闻文章中的可视化实体，并通过开放式关系抽取模型捕获非视觉实体，从而显着提高现有模型的性能和实现新的多个基准的最新性能。