Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.

本研究针对大型语言模型（LLMs）在指令调优过程中面临的数据集覆盖不足和与预训练分布不匹配的问题，提出了一种新方法，即对齐指令调优与预训练（AITP）。通过重写不足的数据，生成高质量的指令-响应对，该方法不仅提升了数据集的多样性，还在八个基准测试中展现了明显的性能改进，展示了通过对齐两者的分布，可以充分发挥LLMs的潜力。

指令调优与预训练的对齐