Chain-of-thought distillation is a powerful technique for transferring
reasoning abilities from large language models (LLMs) to smaller student
models. Previous methods typically require the student to mimic the
step-by-step rationale produced by LLMs, often facing the following challenges:
(i) Tokens within a rationale vary in significance, and treating them equally
may fail to accurately mimic keypoint tokens, leading to reasoning errors. (ii)
They usually distill knowledge by consistently predicting all the steps in a
rationale, which falls short in distinguishing the learning order of step
generation. This diverges from the human cognitive progression of starting with
easy tasks and advancing to harder ones, resulting in sub-optimal outcomes. To
this end, we propose a unified framework, called KPOD, to address these issues.
Specifically, we propose a token weighting module utilizing mask learning to
encourage accurate mimicry of keypoint tokens by the student during
distillation. Besides, we develop an in-rationale progressive distillation
strategy, starting with training the student to generate the final reasoning
steps and gradually extending to cover the entire rationale. To accomplish
this, a weighted token generation loss is proposed to assess step reasoning
difficulty, and a value function is devised to schedule the progressive
distillation by considering both step difficulty and question diversity.
Extensive experiments on four reasoning benchmarks illustrate our KPOD
outperforms previous methods by a large margin.

KPOD 框架通过利用遮罩学习来鼓励学生精确模仿关键点标记，并通过渐进式教学策略逐步扩展到整个论证过程，实现了来自大型语言模型的推理能力向较小学生模型的转移，取得了远超之前方法的广泛实验结果。

基于关键点的渐进式思维链提取法用于 LLMs

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs

Knowledge distillation, the technique of transferring knowledge from large,
complex models to smaller ones, marks a pivotal step towards efficient AI
deployment. Distilling Step-by-Step (DSS), a novel method utilizing
chain-of-thought (CoT) distillation, has demonstrated promise by imbuing
smaller models with the superior reasoning capabilities of their larger
counterparts. In DSS, the distilled model acquires the ability to generate
rationales and predict labels concurrently through a multi-task learning
framework. However, DSS overlooks the intrinsic relationship between the two
training tasks, leading to ineffective integration of CoT knowledge with the
task of label prediction. To this end, we investigate the mutual relationship
of the two tasks from Information Bottleneck perspective and formulate it as
maximizing the mutual information of the representation features of the two
tasks. We propose a variational approach to solve this optimization problem
using a learning-based method. Our experimental results across four datasets
demonstrate that our method outperforms the state-of-the-art DSS. Our findings
offer insightful guidance for future research on language model distillation as
well as applications involving CoT. Code and models will be released soon.

利用连续思维蒸馏的知识蒸馏技术，通过多任务学习框架，最大化两个训练任务的特征表示的互信息，提出一种变分方法来优化小型模型的推理能力和标签预测的整合性，并在四个数据集上超越先进的 DSS 方法，为语言模型蒸馏和连续思维相关应用的未来研究提供有益指导。

学习最大化互信息用于思路链提炼

Learning to Maximize Mutual Information for Chain-of-Thought  Distillation

Large language models (LLMs) have achieved remarkable advancements in the
field of natural language processing. However, the sheer scale and
computational demands of these models present formidable challenges when
considering their practical deployment in resource-constrained contexts. While
techniques such as chain-of-thought (CoT) distillation have displayed promise
in distilling LLMs into small language models (SLMs), there is a risk that
distilled SLMs may still carry over flawed reasoning or hallucinations
inherited from their LLM counterparts. To address these issues, we propose a
twofold methodology: First, we introduce a novel method for distilling the
self-evaluation capability inherent in LLMs into SLMs, which aims to mitigate
the adverse effects of erroneous reasoning and reduce hallucinations. Second,
we advocate for a comprehensive distillation process that incorporates multiple
distinct chain-of-thought and self-evaluation paradigms and ensures a more
holistic and robust knowledge transfer into SLMs. Experiments on three NLP
benchmarks demonstrate that our method significantly improves the performance
of distilled SLMs and sheds light on the path towards developing smaller models
closely aligned with human cognition.

大型语言模型（LLMs）在自然语言处理领域取得了显著的进展，但是考虑到它们的规模和计算需求，将这些模型实际部署在资源受限的环境中面临着巨大的挑战。为了解决这些问题，我们提出了一种双重方法：首先，我们引入了一种将 LLMs 中固有的自我评价能力提取到 SLMs 中的新方法，旨在减少错误推理和幻觉的不利影响。其次，我们建议采用综合的蒸馏过程，结合多种不同的链式思维和自我评价范式，确保更全面、更稳健地将知识转移至 SLMs 中。在三个自然语言处理基准测试上进行的实验表明，我们的方法显著提高了蒸馏 SLMs 的性能，并为开发与人类认知更接近的较小模型指明了方向。