Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task.

本文介绍一种名为“Distilling step-by-step”的新机制，该机制通过在多任务训练框架内提取 LLM rationales 作为小型模型的附加监督来训练比 LLM 更小且表现更好的模型，并且使用远少于 finetuning 或 distillation 所需的标注数据。作者研究表明，相对于 finetuning 和 distillation，本机制使用更少的标注/非标注训练样例实现更好的性能；并且相对于 LLMs，使用明显更小的模型尺寸实现更好的性能；作者使用了 only 80% of available data on a benchmark task，就可以使用 770M T5 模型胜过 540B PaLM。

蒸馏逐步！用更少的训练数据和更小的模型尺寸胜过更大的语言模型