We introduce "pointer-guided segment ordering" (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.

我们提出了一种名为“指向引导的段落排序”（SO）的新型预训练技术，旨在增强大型语言模型中段落级文本表示的上下文理解。该方法利用自注意力驱动的指针网络来恢复被乱序的文本段落的原始顺序，解决了捕捉文档内部结构连贯性和上下文依赖关系的挑战。这种预训练方法通过结合动态采样的微调方法，增加了训练实例的多样性，并提高了各种下游应用中的采样效率。我们在各种数据集上评估了该方法，在需要对科技文献和财务报告领域的连续文本进行分类的任务中，展示了其有效性。我们的实验结果表明，指向引导的预训练显著增强了模型理解复杂文档结构的能力，并在下游分类任务中达到了最先进的性能。

指针引导的预训练：将大型语言模型注入段落级上下文意识