Instruction tuning is widely recognized as a key technique for building generalist language models, which comes to the attention of researchers and the public with the release of InstructGPT \cite{ouyang2022training} and ChatGPT [ https://chat.openai.com/ ]. Despite impressive progress in English-oriented large-scale language models (\textbf{LLMs}), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and brief some potential applications of the newly constructed Chinese instruction corpora.

通过多种方法适应4个子任务的内在特征，我们提出了一个项目来创建中文指令数据集，收集了约20万个中文指令调整样本，并总结了现有的英文和中文指令语料库以及新构建的中文指令语料库的潜在应用。

中文开放指令通用程序员: 初步发布