With the advancement of multimedia technologies, news documents and
user-generated content are often represented as multiple modalities, making
Multimedia Event Extraction (MEE) an increasingly important challenge. However,
recent MEE methods employ weak alignment strategies and data augmentation with
simple classification models, which ignore the capabilities of natural
language-formulated event templates for the challenging Event Argument
Extraction (EAE) task. In this work, we focus on EAE and address this issue by
introducing a unified template filling model that connects the textual and
visual modalities via textual prompts. This approach enables the exploitation
of cross-ontology transfer and the incorporation of event-specific semantics.
Experiments on the M2E2 benchmark demonstrate the effectiveness of our
approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and
performs generally better than the second-best systems for multimedia EAE.

通过引入统一的模板填充模型，我们的方法可以连接文本和视觉模态，并通过文本提示实现跨本体转移和事件特定语义的整合。在 M2E2 基准上的实验证明了我们方法的有效性，我们的系统在文本 EAE 上超过当前的最佳方法 7% F1，且在多媒体 EAE 方面表现普遍更优秀。

MMUTF: 统一模板填充的多模态多媒体事件论元抽取

MMUTF: Multimodal Multimedia Event Argument Extraction with Unified  Template Filling

In recent years, multi-modal machine translation has attracted significant
interest in both academia and industry due to its superior performance. It
takes both textual and visual modalities as inputs, leveraging visual context
to tackle the ambiguities in source texts. In this paper, we begin by offering
an exhaustive overview of 99 prior works, comprehensively summarizing
representative studies from the perspectives of dominant models, datasets, and
evaluation metrics. Afterwards, we analyze the impact of various factors on
model performance and finally discuss the possible research directions for this
task in the future. Over time, multi-modal machine translation has developed
more types to meet diverse needs. Unlike previous surveys confined to the early
stage of multi-modal machine translation, our survey thoroughly concludes these
emerging types from different aspects, so as to provide researchers with a
better understanding of its current state.

多模态机器翻译是近年来引起学术界和工业界广泛关注的研究领域，本文通过综述先前的 99 项研究工作，全面总结了主要模型、数据集和评估指标，分析了各种因素对模型性能的影响，并讨论了未来该领域的研究方向。与之前限制在早期多模态机器翻译的调查不同，我们的调查从不同角度深入总结了这些新兴类型，以便为研究人员提供对目前研究状况的更好理解。

多模态机器翻译调查：任务、方法与挑战

A Survey on Multi-modal Machine Translation: Tasks, Methods and  Challenges

Stance detection is a challenging task that aims to identify public opinion
from social media platforms with respect to specific targets. Previous work on
stance detection largely focused on pure texts. In this paper, we study
multi-modal stance detection for tweets consisting of texts and images, which
are prevalent in today's fast-growing social media platforms where people often
post multi-modal messages. To this end, we create five new multi-modal stance
detection datasets of different domains based on Twitter, in which each example
consists of a text and an image. In addition, we propose a simple yet effective
Targeted Multi-modal Prompt Tuning framework (TMPT), where target information
is leveraged to learn multi-modal stance features from textual and visual
modalities. Experimental results on our three benchmark datasets show that the
proposed TMPT achieves state-of-the-art performance in multi-modal stance
detection.

通过整合文本和图像的多模态信息，文章提出了一种简单而有效的 TMPT 框架，利用目标信息从文本和视觉模态学习多模态姿态特征，并在三个基准数据集上取得了最先进的多模态姿态检测性能。