The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4's assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.

我们提出了基于GPT-4的高效自指导方法，通过翻译少量英语指令并进行修订，为日语构建高质量的指令数据和评估基准，并证明使用我们的GPT-4自指导数据进行微调的模型在所有三个基本预训练模型上表现明显优于Japanese-Alpaca。

Rapidly Developing High-quality Instruction Data and Evaluation
  Benchmark for Large Language Models with Minimal Human Effort: A Case Study
  on Japanese

最小人工投入快速开发大型语言模型的高质量指导数据和评估基准：以日语为例的案例研究