Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.

本研究针对指令调优领域的一个普遍假设进行探讨，即较大或更强的模型是较小模型的更强教学者。通过对多个模型和响应生成器的广泛实验，研究发现此假设并不成立，并提出了一种新颖的度量标准“兼容性调整奖励(CAR)”，能够更准确地评估响应生成器的效果，实验结果表明CAR优于几乎所有基线指标。

更强的模型并不是更强的教学者：对指令调优的反思