We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.

我们提出了一种基于预训练语言模型的跨领域自动测量和上下文提取方法，并通过构建多源多领域语料库和训练端到端提取流水线，应用多源任务自适应预训练和微调方法来评估模型的跨领域泛化能力，并通过任务特定的误差分析得出未来工作的见解。我们的结果表明，多源训练导致最佳的整体结果，而针对各自单个领域的训练则产生最佳的结果。虽然我们的设置在提取数量值和单位方面取得了成功，但仍需进一步研究以改进上下文实体的提取。我们将本文中使用的跨领域语料库在线提供。

跨领域测量、单元和上下文提取的多源(预)训练