Contemporary news reporting increasingly features multimedia content,
motivating research on multimedia event extraction. However, the task lacks
annotated multimodal training data and artificially generated training data
suffer from the distribution shift from the real-world data. In this paper, we
propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which
successfully utilizes artificially generated multimodal training data and
achieves state-of-the-art performance. Conditioned on unimodal training data,
we generate multimodal training data using off-the-shelf image generators like
Stable Diffusion and image captioners like BLIP. In order to learn robust
features that are effective across domains, we devise an iterative and gradual
annealing training strategy. Substantial experiments show that CAMEL surpasses
state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events
in particular, we outperform the prior SOTA by 4.2\% F1 on event mention
identification and by 9.8\% F1 on argument identification, which demonstrates
that CAMEL learns synergistic representations from the two modalities.

本文提出了一个名为 CAMEL 的跨模态增强多媒体事件学习方法（Cross-modality Augmented Multimedia Event Learning），它使用了人工生成的多模态训练数据，实现了领先水平，并在多媒体事件提取方面优于现有研究。

利用生成的图像和字幕训练多媒体事件提取

Training Multimedia Event Extraction With Generated Images and Captions

Visual and textual modalities contribute complementary information about
events described in multimedia documents. Videos contain rich dynamics and
detailed unfoldings of events, while text describes more high-level and
abstract concepts. However, existing event extraction methods either do not
handle video or solely target video while ignoring other modalities. In
contrast, we propose the first approach to jointly extract events from video
and text articles. We introduce the new task of Video MultiMedia Event
Extraction (Video M2E2) and propose two novel components to build the first
system towards this task. First, we propose the first self-supervised
multimodal event coreference model that can determine coreference between video
events and text events without any manually annotated pairs. Second, we
introduce the first multimodal transformer which extracts structured event
information jointly from both videos and text documents. We also construct and
will publicly release a new benchmark of video-article pairs, consisting of 860
video-article pairs with extensive annotations for evaluating methods on this
task. Our experimental results demonstrate the effectiveness of our proposed
method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score
gain on multimodal event coreference resolution and multimedia event
extraction.

本篇论文介绍了一种新的视频多媒体事件提取（Video M2E2）任务以及两个创新组件，用于构建该任务的第一个系统。该方法能够从视频和文本文档中提取结构化事件信息，未来将会公开发布包括 860 对视频 - 文章对的新基准。实验结果证明了该方法在新基准数据集上的有效性。

视频和文章的联合多媒体事件抽取

Joint Multimedia Event Extraction from Video and Article

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to
extract events and their arguments from multimedia documents. We develop the
first benchmark and collect a dataset of 245 multimedia news articles with
extensively annotated events and arguments. We propose a novel method, Weakly
Aligned Structured Embedding (WASE), that encodes structured representations of
semantic information from textual and visual data into a common embedding
space. The structures are aligned across modalities by employing a weakly
supervised training strategy, which enables exploiting available resources
without explicit cross-media annotation. Compared to uni-modal state-of-the-art
methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text
event argument role labeling and visual event extraction. Compared to
state-of-the-art multimedia unstructured representations, we achieve 8.3% and
5.0% absolute F-score gains on multimedia event extraction and argument role
labeling, respectively. By utilizing images, we extract 21.4% more event
mentions than traditional text-only methods.

该研究提出了一个新的任务 —— 多媒体事件抽取 (M2E2)，旨在从多媒体文档中提取事件及其参数。研究使用弱监督训练策略，建立多媒体事件抽取的基准测试和数据集，并提出了一种新的方法 WASE，可将文本和视觉数据的语义信息编码到共同的嵌入空间中，并取得了较好的效果。