An important handicap of document analysis research is that documents tend to
be copyrighted or contain private information, which prohibits their open
publication and the creation of centralised, large-scale document datasets.
Instead, documents are scattered in private data silos, making extensive
training over heterogeneous data a tedious task. In this work, we explore the
use of a federated learning (FL) scheme as a way to train a shared model on
decentralised private document data. We focus on the problem of Document VQA, a
task particularly suited to this approach, as the type of reasoning
capabilities required from the model can be quite different in diverse domains.
Enabling training over heterogeneous document datasets can thus substantially
enrich DocVQA models. We assemble existing DocVQA datasets from diverse domains
to reflect the data heterogeneity in real-world applications. We explore the
self-pretraining technique in this multi-modal setting, where the same data is
used for both pretraining and finetuning, making it relevant for privacy
preservation. We further propose combining self-pretraining with a Federated
DocVQA training method using centralized adaptive optimization that outperforms
the FedAvg baseline. With extensive experiments, we also present a
multi-faceted analysis on training DocVQA models with FL, which provides
insights for future research on this task. We show that our pretraining
strategies can effectively learn and scale up under federated training with
diverse DocVQA datasets and tuning hyperparameters is essential for practical
document tasks under federation.

使用联邦学习方案训练基于分散私密文档数据的共享模型，以丰富各种领域中 DocVQA 模型的数据异质性，结合自预训练技术和集中自适应优化的联邦文档 VQA 训练方法优于 FedAvg 基线，并通过大量实验提供关于使用联邦学习训练 DocVQA 模型的多方面分析，为未来相关研究提供洞见。

联邦文档视觉问答：一项初步研究

Federated Document Visual Question Answering: A Pilot Study

For most natural language processing tasks, the dominant practice is to
finetune large pretrained transformer models (e.g., BERT) using smaller
downstream datasets. Despite the success of this approach, it remains unclear
to what extent these gains are attributable to the massive background corpora
employed for pretraining versus to the pretraining objectives themselves. This
paper introduces a large-scale study of self-pretraining, where the same
(downstream) training data is used for both pretraining and finetuning. In
experiments addressing both ELECTRA and RoBERTa models and 10 distinct
downstream datasets, we observe that self-pretraining rivals standard
pretraining on the BookWiki corpus (despite using around
$10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$
datasets, respectively. Surprisingly, these task-specific pretrained models
often perform well on other tasks, including the GLUE benchmark. Our results
suggest that in many scenarios, performance gains attributable to pretraining
are driven primarily by the pretraining objective itself and are not always
attributable to the incorporation of massive datasets. These findings are
especially relevant in light of concerns about intellectual property and
offensive content in web-scale pretraining data.

本文介绍了一个大规模的自我训练研究，其中使用相同的（下游）训练数据进行预训练和微调，并且观察到自我预训练可以与标准预训练相媲美，这表明在许多情况下，预训练性能增益主要受预训练目标本身的驱动，而不一定是庞大数据集的影响。

下游数据集出人意料地成为良好的预训练语料库

Downstream Datasets Make Surprisingly Good Pretraining Corpora

We present a neural semi-supervised learning model termed Self-Pretraining.
Our model is inspired by the classic self-training algorithm. However, as
opposed to self-training, Self-Pretraining is threshold-free, it can
potentially update its belief about previously labeled documents, and can cope
with the semantic drift problem. Self-Pretraining is iterative and consists of
two classifiers. In each iteration, one classifier draws a random set of
unlabeled documents and labels them. This set is used to initialize the second
classifier, to be further trained by the set of labeled documents. The
algorithm proceeds to the next iteration and the classifiers' roles are
reversed. To improve the flow of information across the iterations and also to
cope with the semantic drift problem, Self-Pretraining employs an iterative
distillation process, transfers hypotheses across the iterations, utilizes a
two-stage training model, uses an efficient learning rate schedule, and employs
a pseudo-label transformation heuristic. We have evaluated our model in three
publicly available social media datasets. Our experiments show that
Self-Pretraining outperforms the existing state-of-the-art semi-supervised
classifiers across multiple settings. Our code is available at
this https URL

该研究提出了一种名为 Self-Pretraining 的神经半监督学习模型，该模型可以无阈值地更新先前标记的文档的信念，并且可以处理语义漂移问题，使用迭代蒸馏过程，跨迭代传输假设，利用两阶段训练模型，使用高效的学习率调度和使用伪标签转换启发式方法。