Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely
used to expand the model's fundamental understanding of specific downstream
domains (e.g., math and code). For the CPT on domain-specific LLMs, one
important question is how to choose the optimal mixture ratio between the
general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus.
Existing methods usually adopt laborious human efforts by grid-searching on a
set of mixture ratios, which require high GPU training consumption costs.
Besides, we cannot guarantee the selected ratio is optimal for the specific
domain. To address the limitations of existing methods, inspired by the Scaling
Law for performance prediction, we propose to investigate the Scaling Law of
the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal
mixture ratio with acceptable training costs for LLMs of different sizes.
Specifically, by fitting the D-CPT Law, we can easily predict the general and
downstream performance of arbitrary mixture ratios, model sizes, and dataset
sizes using small-scale training costs on limited experiments. Moreover, we
also extend our standard D-CPT Law on cross-domain settings and propose the
Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very
small training costs (about 1% of the normal training costs) are needed for the
target domains. Comprehensive experimental results on six downstream domains
demonstrate the effectiveness and generalizability of our proposed D-CPT Law
and Cross-Domain D-CPT Law.

基于尺度定律的领域特定持续预训练法 (D-CPT Law) 可用于预测不同尺寸的语言模型的理想混合比例，以及交叉领域的 D-CPT Law 可用于目标领域的预测，不同尺寸和数据集尺寸的训练成本相对较低。

D-CPT 法：针对大型语言模型的领域专用持续预训练规模定律

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large  Language Models

Large language models (LLMs) are now widely used in various fields, including
finance. However, Japanese financial-specific LLMs have not been proposed yet.
Hence, this study aims to construct a Japanese financial-specific LLM through
continual pre-training. Before tuning, we constructed Japanese
financial-focused datasets for continual pre-training. As a base model, we
employed a Japanese LLM that achieved state-of-the-art performance on Japanese
financial benchmarks among the 10-billion-class parameter models. After
continual pre-training using the datasets and the base model, the tuned model
performed better than the original model on the Japanese financial benchmarks.
Moreover, the outputs comparison results reveal that the tuned model's outputs
tend to be better than the original model's outputs in terms of the quality and
length of the answers. These findings indicate that domain-specific continual
pre-training is also effective for LLMs. The tuned model is publicly available
on Hugging Face.

此研究旨在通过不断预训练的方式构建一种针对日本金融领域的特定大型语言模型，并证明该模型在日本金融基准测试中的性能优于原始模型。研究表明，针对特定领域的不断预训练也对大型语言模型具有有效性。调整后的模型现已在 Hugging Face 平台上公开提供。

通过持续预训练构建金融领域特定的日文大语言模型

Construction of Domain-specified Japanese Large Language Model for  Finance through Continual Pre-training

Large Language Models (LLMs) pre-trained on massive corpora have exhibited
remarkable performance on various NLP tasks. However, applying these models to
specific domains still poses significant challenges, such as lack of domain
knowledge, limited capacity to leverage domain knowledge and inadequate
adaptation to domain-specific data formats. Considering the exorbitant cost of
training LLMs from scratch and the scarcity of annotated data within particular
domains, in this work, we focus on domain-specific continual pre-training of
LLMs using E-commerce domain as an exemplar. Specifically, we explore the
impact of continual pre-training on LLMs employing unlabeled general and
E-commercial corpora. Furthermore, we design a mixing strategy among different
data sources to better leverage E-commercial semi-structured data. We construct
multiple tasks to assess LLMs' few-shot In-context Learning ability and their
zero-shot performance after instruction tuning in E-commerce domain.
Experimental results demonstrate the effectiveness of continual pre-training of
E-commerce LLMs and the efficacy of our devised data mixing strategy.

大型语言模型（LLMs）预先训练在海量语料库上，在各种 NLP 任务中展示了出色的性能。本文针对特定领域应用这些模型仍然存在着显著挑战，如缺乏领域知识、有限的领域知识利用能力和不足的领域特定数据格式适应能力。因此，本研究聚焦于以电子商务领域为示例进行面向领域的持续预训练。具体而言，我们探讨了在无标签的一般和电子商务语料库上进行持续预训练对 LLMs 的影响。此外，我们设计了一种混合策略来更好地利用电子商务半结构化数据。我们构建了多个任务来评估 LLMs 在电子商务领域中的少样本上下文学习能力以及经过指令调整后的零样本性能。实验结果证明了电子商务 LLMs 持续预训练的有效性，以及我们设计的数据混合策略的功效。