We explore how continued pre-training on domain-specific corpora influences
large language models, revealing that training on the raw corpora endows the
model with domain knowledge, but drastically hurts its prompting ability for
question answering. Taken inspiration from human learning via reading
comprehension--practice after reading improves the ability to answer questions
based on the learned knowledge--we propose a simple method for transforming raw
corpora into reading comprehension texts. Each raw text is enriched with a
series of tasks related to its content. Our method, highly scalable and
applicable to any pre-training corpora, consistently enhances performance
across various tasks in three different domains: biomedicine, finance, and law.
Notably, our 7B language model achieves competitive performance with
domain-specific models of much larger scales, such as BloombergGPT-50B.
Furthermore, we demonstrate that domain-specific reading comprehension texts
can improve the model's performance even on general benchmarks, showing the
potential to develop a general model across even more domains. Our model, code,
and data will be available at this https URL

我们研究了如何在特定领域的语料库上继续进行预训练，发现在原始语料库上进行训练赋予了模型领域知识，但严重损害了其对问题回答的提示能力。借鉴人类通过阅读理解进行学习的灵感，我们提出了一种简单的方法，将原始语料库转化为阅读理解文本。每个原始文本都会丰富其内容相关的一系列任务。我们的方法可以高度扩展，适用于任何预训练语料库，并在生物医学、金融和法律三个不同领域的各种任务中持续提升性能。值得注意的是，我们的 7B 语言模型在性能上与规模更大的专门领域模型（如 BloombergGPT-50B）相媲美。此外，我们证明领域特定的阅读理解文本甚至可以提高模型在通用基准上的性能，展示了在更多领域开发通用模型的潜力。我们的模型、代码和数据将可在此 https URL 中获取。

通过阅读理解调整大型语言模型

Adapting Large Language Models via Reading Comprehension

Large pre-trained language models (PLMs) have shown remarkable performance
across various natural language understanding (NLU) tasks, particularly in
low-resource settings. Nevertheless, their potential in Automatic Speech
Recognition (ASR) remains largely unexplored. This study investigates the
potential usage of PLMs for language modelling in ASR. We compare the
application of large-scale text sampling and probability conversion for
approximating GPT-2 into an n-gram model. Furthermore, we introduce a
vocabulary-restricted decoding method for random sampling, and evaluate the
effects of domain difficulty and data size on the usability of generated text.
Our findings across eight domain-specific corpora support the use of
sampling-based approximation and show that interpolating with a large sampled
corpus improves test perplexity over a baseline trigram by 15%. Our
vocabulary-restricted decoding method pushes this improvement further by 5% in
domain-specific settings.

本研究调查了预先训练的语言模型在自动语音识别中的潜在用途，对比了大规模文本抽样和概率转换的应用。在八个特定领域的语料库中，发现采样的近似方法支持使用，插值与大规模文本语料库一起使用对比基线三元组能使测试困惑度提高 15％，我们引入了一种有限制的词汇解码方法，这将进一步提高 5％的改进。

关于预训练语言模型 N-gram 逼近的研究

On the N-gram Approximation of Pre-trained Language Models

Accurate terminology translation is crucial for ensuring the practicality and
reliability of neural machine translation (NMT) systems. To address this,
lexically constrained NMT explores various methods to ensure pre-specified
words and phrases appear in the translation output. However, in many cases,
those methods are studied on general domain corpora, where the terms are mostly
uni- and bi-grams (>98%). In this paper, we instead tackle a more challenging
setup consisting of domain-specific corpora with much longer n-gram and highly
specialized terms. Inspired by the recent success of masked span prediction
models, we propose a simple and effective training strategy that achieves
consistent improvements on both terminology and sentence-level translation for
three domain-specific corpora in two language pairs.

本文提出了一种简单有效的训练策略，通过应用掩蔽跨度预测模型， 实现了对两种语言的三个特定领域语料库在术语级和句子级翻译方面的持续改进，以解决神经机器翻译系统术语翻译的实用性和可靠性问题。