The use of visually-rich documents (VRDs) in various fields has created a
demand for Document AI models that can read and comprehend documents like
humans, which requires the overcoming of technical, linguistic, and cognitive
barriers. Unfortunately, the lack of appropriate datasets has significantly
hindered advancements in the field. To address this issue, we introduce
\textsc{DocTrack}, a VRD dataset really aligned with human eye-movement
information using eye-tracking technology. This dataset can be used to
investigate the challenges mentioned above. Additionally, we explore the impact
of human reading order on document understanding tasks and examine what would
happen if a machine reads in the same order as a human. Our results suggest
that although Document AI models have made significant progress, they still
have a long way to go before they can read VRDs as accurately, continuously,
and flexibly as humans do. These findings have potential implications for
future research and development of Document AI models. The data is available at
https://github.com/hint-lab/doctrack.

使用者眼动追踪技术对齐的 VRD 数据集，研究人类阅读顺序对文档理解任务的影响，结果显示虽然文档 AI 模型取得了显著进展，但在与人类相比准确、连续和灵活地阅读 VRD 方面仍有很长的路要走，对未来的文档 AI 模型研究和开发具有潜在的影响。

DocTrack: 一个与人眼运动真正对齐的视觉丰富的文档数据集，用于机器阅读

DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye  Movement for Machine Reading

NeSy4VRD is a multifaceted resource designed to support the development of
neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the
images of the VRD dataset and couples them with an extensively revised,
quality-improved version of the VRD visual relationship annotations. Crucially,
NeSy4VRD provides a well-aligned, companion OWL ontology that describes the
dataset domain.It comes with open source infrastructure that provides
comprehensive support for extensibility of the annotations (which, in turn,
facilitates extensibility of the ontology), and open source code for loading
the annotations to/from a knowledge graph. We are contributing NeSy4VRD to the
computer vision, NeSy and Semantic Web communities to help foster more NeSy
research using OWL-based knowledge graphs.

NeSy4VRD 是一个多面资源，以支持神经符号 AI (NeSy) 研究的发展为设计目的。其中包括 VRD 数据集的图像和经过大幅度修订、质量改进的 VRD 视觉关系标注。重要的是，NeSy4VRD 提供了一个与数据集领域相关的用于描述本体论的 OWL 本体论，以及用于将标注加载到知识图上的开源基础架构。我们将 NeSy4VRD 贡献给计算机视觉、NeSy 和语义 Web 社区，以帮助促进使用基于 OWL 的知识图的更多 NeSy 研究。

NeSy4VRD: 用知识图谱进行视觉关系检测的神经符号化人工智能研究多方面资源

NeSy4VRD: A Multifaceted Resource for Neurosymbolic AI Research using  Knowledge Graphs in Visual Relationship Detection

Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets,
is a challenging Scene Understanding task approached in the past via linguistic
priors or spatial information in a single feature branch. We introduce a new
deeply supervised two-branch architecture, the Multimodal Attentional
Translation Embeddings, where the visual features of each branch are driven by
a multimodal attentional mechanism that exploits spatio-linguistic similarities
in a low-dimensional space. We present a variety of experiments comparing
against all related approaches in the literature, as well as by re-implementing
and fine-tuning several of them. Results on the commonly employed VRD dataset
[1] show that the proposed method clearly outperforms all others, while we also
justify our claims both quantitatively and qualitatively.

本论文提出了一种新的基于深度学习的架构 —— 多模态注意力翻译嵌入模型，该模型使用多模态关注机制驱动每个分支的视觉特征，并在常用的 VRD 数据集上的实验表明，该方法明显优于其他相关方法。