Recently, various multimodal networks for Visually-Rich Document
Understanding(VRDU) have been proposed, showing the promotion of transformers
by integrating visual and layout information with the text embeddings. However,
most existing approaches utilize the position embeddings to incorporate the
sequence information, neglecting the noisy improper reading order obtained by
OCR tools. In this paper, we propose a robust layout-aware multimodal network
named XYLayoutLM to capture and leverage rich layout information from proper
reading orders produced by our Augmented XY Cut. Moreover, a Dilated
Conditional Position Encoding module is proposed to deal with the input
sequence of variable lengths, and it additionally extracts local layout
information from both textual and visual modalities while generating position
embeddings. Experiment results show that our XYLayoutLM achieves competitive
results on document understanding tasks.

本论文提出了一种名为 XYLayoutLM 的鲁棒的布局感知多模态网络，它可以从通过 Augmented XY Cut 生成的正确阅读顺序中捕获和利用丰富的布局信息，并且提出了一种扩展有条件位置编码模块来处理变长输入序列，同时从文本和视觉模态中提取局部布局信息生成位置编码，并在文档理解任务上取得了竞争性的结果。

XYLayoutLM：面向布局感知的多模态网络，用于视觉丰富的文档理解

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Automatically generating a natural language sentence to describe the content
of an input video is a very challenging problem. It is an essential multimodal
task in which auditory and visual contents are equally important. Although
audio information has been exploited to improve video captioning in previous
works, it is usually regarded as an additional feature fed into a black box
fusion machine. How are the words in the generated sentences associated with
the auditory and visual modalities? The problem is still not investigated. In
this paper, we make the first attempt to design an interpretable audio-visual
video captioning network to discover the association between words in sentences
and audio-visual sequences. To achieve this, we propose a multimodal
convolutional neural network-based audio-visual video captioning framework and
introduce a modality-aware module for exploring modality selection during
sentence generation. Besides, we collect new audio captioning and visual
captioning datasets for further exploring the interactions between auditory and
visual modalities for high-level video understanding. Extensive experiments
demonstrate that the modality-aware module makes our model interpretable on
modality selection during sentence generation. Even with the added
interpretability, our video captioning network can still achieve comparable
performance with recent state-of-the-art methods.

本论文介绍了一个多模态卷积神经网络视频字幕框架，通过引入模态感知模块，探索了视听交互对视频理解的影响，并证明该可解释模型在情况选择时取得了可比较的性能。

可解释的视听视频字幕生成尝试

An Attempt towards Interpretable Audio-Visual Video Captioning

The seen birds twitter, the running cars accompany with noise, etc. These
naturally audiovisual correspondences provide the possibilities to explore and
understand the outside world. However, the mixed multiple objects and sounds
make it intractable to perform efficient matching in the unconstrained
environment. To settle this problem, we propose to adequately excavate audio
and visual components and perform elaborate correspondence learning among them.
Concretely, a novel unsupervised audiovisual learning model is proposed, named
as \Deep Multimodal Clustering (DMC), that synchronously performs sets of
clustering with multimodal vectors of convolutional maps in different shared
spaces for capturing multiple audiovisual correspondences. And such integrated
multimodal clustering network can be effectively trained with max-margin loss
in the end-to-end fashion. Amounts of experiments in feature evaluation and
audiovisual tasks are performed. The results demonstrate that DMC can learn
effective unimodal representation, with which the classifier can even
outperform human performance. Further, DMC shows noticeable performance in
sound localization, multisource detection, and audiovisual understanding.

提出了一种名为 Deep Multimodal Clustering 的无监督音频视觉学习模型，采用不同共享空间的多模态矢量的卷积映射集，进行多音频视觉对应关系的捕捉和精细对应学习，并通过最大间隔损失进行有效训练。通过实验，该模型可以学习到有效的单模态表示，并在声音定位、多源检测和音频视觉理解方面显示出显著的性能。