News image captioning task is a variant of image captioning task which
requires model to generate a more informative caption with news image and the
associated news article. Multimodal Large Language models have developed
rapidly in recent years and is promising in news image captioning task.
However, according to our experiments, common MLLMs are not good at generating
the entities in zero-shot setting. Their abilities to deal with the entities
information are still limited after simply fine-tuned on news image captioning
dataset. To obtain a more powerful model to handle the multimodal entity
information, we design two multimodal entity-aware alignment tasks and an
alignment framework to align the model and generate the news image captions.
Our method achieves better results than previous state-of-the-art models in
CIDEr score (72.33 -> 86.29) on GoodNews dataset and (70.83 -> 85.61) on
NYTimes800k dataset.

新闻图像标题任务是图像标题任务的一种变体，要求模型生成与新闻图像和相关新闻文章更相关的标题。多模态大型语言模型在近年来得到快速发展，并在新闻图像标题任务中具有良好的前景。然而，根据我们的实验，普通的多模态大型语言模型在零样本学习环境中生成实体方面的能力还十分有限。仅仅在新闻图像标题数据集上进行微调后，它们处理实体信息的能力仍然不够。为了获得一个更强大的模型来处理多模态实体信息，我们设计了两个多模态实体感知对齐任务和一个对齐框架来对齐模型并生成新闻图像标题。我们的方法在 GoodNews 数据集上的 CIDEr 分数（72.33 -> 86.29）和 NYTimes800k 数据集上的 CIDEr 分数（70.83 -> 85.61）上取得了比先前最先进模型更好的结果。

针对新闻图像字幕生成的实体感知多模态对齐框架

Entity-Aware Multimodal Alignment Framework for News Image Captioning

This notebook paper presents our model in the VATEX video captioning
challenge. In order to capture multi-level aspects in the video, we propose to
integrate both temporal and spatial attentions for video captioning. The
temporal attentive module focuses on global action movements while spatial
attentive module enables to describe more fine-grained objects. Considering
these two types of attentive modules are complementary, we thus fuse them via a
late fusion strategy. The proposed model significantly outperforms baselines
and achieves 73.4 CIDEr score on the testing set which ranks the second place
at the VATEX video captioning challenge leaderboard 2019.

该论文提出了一种用于视频字幕生成的模型，该模型在时间和空间上均考虑了注意力机制，并通过后期融合策略将这两种机制结合起来，从而显著提高了生成字幕的性能，达到了 73.4 的 CIDEr 得分，并在 VATEX 视频字幕生成挑战赛上获得第二名。