Open-source large language models (LLMs) have gained significant strength across diverse fields. Nevertheless, the majority of studies primarily concentrate on English, with only limited exploration into the realm of multilingual supervised fine-tuning. In this work, we therefore construct an open-source multilingual supervised fine-tuning dataset. Different from previous works that simply translate English instructions, we consider both the language-specific and language-agnostic abilities of LLMs. For language-specific abilities, we introduce a knowledge-grounded data augmentation approach to elicit more culture-specific knowledge of LLMs, improving their ability to serve users from different countries. For language-agnostic abilities, we find through experiments that modern LLMs exhibit strong cross-lingual transfer capabilities, thus repeatedly learning identical content in various languages is not necessary. Consequently, we can substantially prune the language-agnostic SFT data without any performance degradation, making the SFT process more efficient. The resulting UltraLink dataset comprises approximately 1 million samples across five languages, and the proposed data construction method can also be easily extended to other languages. UltraLink-LM, which is trained on UltraLink, outperforms several representative baselines across many tasks.

本研究构建了一个开源的多语言监督微调数据集，通过引入基于知识的数据增强方法提高了大语言模型从不同国家的用户中获取文化特定知识的能力，并通过实验发现现代大语言模型表现出强大的跨语言转移能力，从而有效地减少了语言无关的微调数据，使得微调过程更加高效。基于构建的UltraLink数据集进行训练的UltraLink-LM在多个任务上优于其他代表性基线模型。

UltraLink：一种开源的知识增强的多语言监督微调数据集