In this paper, we study multimodal coreference resolution, specifically where
a longer descriptive text, i.e., a narration is paired with an image. This
poses significant challenges due to fine-grained image-text alignment, inherent
ambiguity present in narrative language, and unavailability of large annotated
training sets. To tackle these challenges, we present a data efficient
semi-supervised approach that utilizes image-narration pairs to resolve
coreferences and narrative grounding in a multimodal context. Our approach
incorporates losses for both labeled and unlabeled data within a cross-modal
framework. Our evaluation shows that the proposed approach outperforms strong
baselines both quantitatively and qualitatively, for the tasks of coreference
resolution and narrative grounding.

研究采用图像和描述性文本的多模态共指消解，在细粒度的图像 - 文本对齐、叙述语言的内在歧义和缺乏大规模标注数据集的条件下，提出了一种数据效率的半监督方法，用于解决多模态背景下的共指消解和叙述依托问题，通过跨模态框架结合有标注和无标注数据的损失优化，实验证明该方法在共指消解和叙述依托任务上的性能优于强基线模型。

图像叙述中的半监督多模态指代消解

Semi-supervised multimodal coreference resolution in image narrations

In this work, we propose a novel Cycle In Cycle Generative Adversarial
Network (C$^2$GAN) for the task of keypoint-guided image generation. The
proposed C$^2$GAN is a cross-modal framework exploring a joint exploitation of
the keypoint and the image data in an interactive manner. C$^2$GAN contains two
different types of generators, i.e., keypoint-oriented generator and
image-oriented generator. Both of them are mutually connected in an end-to-end
learnable fashion and explicitly form three cycled sub-networks, i.e., one
image generation cycle and two keypoint generation cycles. Each cycle not only
aims at reconstructing the input domain, and also produces useful output
involving in the generation of another cycle. By so doing, the cycles constrain
each other implicitly, which provides complementary information from the two
different modalities and brings extra supervision across cycles, thus
facilitating more robust optimization of the whole network. Extensive
experimental results on two publicly available datasets, i.e., Radboud Faces
and Market-1501, demonstrate that our approach is effective to generate more
photo-realistic images compared with state-of-the-art models.

提出 C2GAN，一种新型循环生成对抗网络，用于关键点引导的图像生成，图像生成器和关键点生成器相互交错地连接在一个可端到端学习的框架中，并形成三个循环子网络，可生成更加逼真的图像。