Text role classification involves classifying the semantic role of textual
elements within scientific charts. For this task, we propose to finetune two
pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on
chart datasets. The transformers utilize the three modalities of text, image,
and layout as input. We further investigate whether data augmentation and
balancing methods help the performance of the models. The models are evaluated
on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in
all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the
ICPR22 test dataset, beating the best-performing model from the ICPR22
CHART-Infographics challenge. Moreover, the robustness of the models is tested
on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the
models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz,
for which we added labels for the text roles. Findings indicate that even in
cases where there is limited training data, transformers can be used with the
help of data augmentation and balancing methods. The source code and datasets
are available on GitHub under
this https URL

文本角色分类涉及对科学图表中的文本元素进行语义角色分类。我们提出在图表数据集上对两个预训练的多模态文档布局分析模型 LayoutLMv3 和 UDOP 进行微调，并利用文本、图像和布局这三种模态作为输入。我们进一步研究了数据增强和平衡方法是否对模型的性能有帮助。模型在各种图表数据集上进行评估，结果表明 LayoutLMv3 在所有实验中表现优于 UDOP。LayoutLMv3 在 ICPR22 测试数据集上获得了 82.87 的最高 F1 宏分数，在 ICPR22 CHART-Infographics 挑战中超过了最佳模型。此外，模型的鲁棒性在合成噪声数据集 ICPR22-N 上进行了测试。最后，我们评估了模型在三个带有文本角色标签的图表数据集 CHIME-R、DeGruyter 和 EconBiz 上的泛化能力。研究结果表明，即使在训练数据有限的情况下，通过数据增强和平衡方法可以使用 transformers。源代码和数据集可在 GitHub 上的此网址找到。

使用多模态 Transformers 的科学图表中的文本角色分类

Text Role Classification in Scientific Charts Using Multimodal  Transformers

This paper presents an application of the LayoutLMv3 model for semantic table
detection on financial documents from the IIIT-AR-13K dataset. The motivation
behind this paper's experiment was that LayoutLMv3's official paper had no
results for table detection using semantic information. We concluded that our
approach did not improve the model's table detection capabilities, for which we
can give several possible reasons. Either the model's weights were unsuitable
for our purpose, or we needed to invest more time in optimising the model's
hyperparameters. It is also possible that semantic information does not improve
a model's table detection accuracy.

本文介绍了一种利用 LayoutLMv3 模型对 IIIT-AR-13K 数据集中的财务文件进行语义表格检测的应用，发现模型表格检测能力并没有得到提升，可能是模型权重不适合所需，或需要更多的优化时间，又或者语义信息并不会提高模型的表格检测准确性。

使用 LayoutLMv3 进行语义表格检测

Semantic Table Detection with LayoutLMv3

Self-supervised pre-training techniques have achieved remarkable progress in
Document AI. Most multimodal pre-trained models use a masked language modeling
objective to learn bidirectional representations on the text modality, but they
differ in pre-training objectives for the image modality. This discrepancy adds
difficulty to multimodal representation learning. In this paper, we propose
\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with
unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a
word-patch alignment objective to learn cross-modal alignment by predicting
whether the corresponding image patch of a text word is masked. The simple
unified architecture and training objectives make LayoutLMv3 a general-purpose
pre-trained model for both text-centric and image-centric Document AI tasks.
Experimental results show that LayoutLMv3 achieves state-of-the-art performance
not only in text-centric tasks, including form understanding, receipt
understanding, and document visual question answering, but also in
image-centric tasks such as document image classification and document layout
analysis. The code and models are publicly available at
https://aka.ms/layoutlmv3.

本文提出的 LayoutLMv3 是一种用于文档人工智能的多模态 Transformer 的预训练方法，用于统一文本和图像遮蔽，并通过预测文本单词的对应图像块是否被遮蔽的方式进行跨模态对齐。实验结果表明，LayoutLMv3 不仅在文本中心任务中实现了最先进的性能，而且在以图像为中心的任务中也是如此。