In this paper, we address the challenge of optimizing training setups for Large Language Models (LLMs) of low-resource language with a limited amount of corpus. Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently. However, there is still a lack of understanding about the optimal hyperparameter setups for combining these three approaches to train LLMs. We exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search: (1) As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold. (2) The optimal model scale remains stable regardless of the amount of target language corpus, allowing the use of the compute-optimal scale of monolingual training. (3) The optimal number of epochs can be extrapolated from smaller-scale experiments to larger scale using our proposed model. Also, we provide evidence that, in single-stage training, the target language validation loss follows a power law with respect to the target language ratio, with an exponent independent of the amount of data, model scale, and language pair.

本研究解决了在低资源语言的情况下，优化大型语言模型训练配置的问题。通过对多轮次、多语言和两阶段训练方法的结合进行深入探索，提出了有效的超参数搜索成本降低策略。研究发现，目标语言语料量的减少会导致最佳训练方法的变化，从单语单阶段训练转向多语两阶段训练，并且最佳模型规模在不同语料量下保持稳定。

优化低资源语言模型训练：多轮次、多语言和两阶段方法的综合分析