In this paper, we propose a novel knowledge transfer framework that
introduces continuous normalizing flows for progressive knowledge
transformation and leverages multi-step sampling strategies to achieve
precision knowledge transfer. We name this framework Knowledge Transfer with
Flow Matching (FM-KT), which can be integrated with a metric-based distillation
method with any form (\textit{e.g.} vanilla KD, DKD, PKD and DIST) and a
meta-encoder with any available architecture (\textit{e.g.} CNN, MLP and
Transformer). By introducing stochastic interpolants, FM-KD is readily amenable
to arbitrary noise schedules (\textit{e.g.}, VP-ODE, VE-ODE, Rectified flow)
for normalized flow path estimation. We theoretically demonstrate that the
training objective of FM-KT is equivalent to minimizing the upper bound of the
teacher feature map or logit negative log-likelihood. Besides, FM-KT can be
viewed as a unique implicit ensemble method that leads to performance gains. By
slightly modifying the FM-KT framework, FM-KT can also be transformed into an
online distillation framework OFM-KT with desirable performance gains. Through
extensive experiments on CIFAR-100, ImageNet-1k, and MS-COCO datasets, we
empirically validate the scalability and state-of-the-art performance of our
proposed methods among relevant comparison approaches.

我们提出了一种新颖的知识转移框架，引入连续归一化流进行渐进知识转化，并利用多步采样策略实现精准知识传递。通过引入随机插值，我们理论上证明了 FM-KT 的训练目标相当于最小化教师特征映射或逻辑负对数似然的上界。此外，FM-KT 可以看作是一种独特的隐式集成方法，从而实现性能提升。通过对 CIFAR-100、ImageNet-1k 和 MS-COCO 数据集进行大量实验证明了我们提出的方法在相关比较方法中的可扩展性和最先进性能。

精确知识传递通过流匹配

Precise Knowledge Transfer via Flow Matching

Learning on a massive amount of speech corpus leads to the recent success of
many self-supervised speech models. With knowledge distillation, these models
may also benefit from the knowledge encoded by language models that are
pre-trained on rich sources of texts. The distillation process, however, is
challenging due to the modal disparity between textual and speech embedding
spaces. This paper studies metric-based distillation to align the embedding
space of text and speech with only a small amount of data without modifying the
model structure. Since the semantic and granularity gap between text and speech
has been omitted in literature, which impairs the distillation, we propose the
Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages
text/speech units of variable granularity and prior distributions to achieve
better global and local alignments between text and speech pre-trained models.
We evaluate on three spoken language understanding benchmarks to show that PAD
is more effective in transferring linguistic knowledge than other metric-based
distillation approaches.

本文介绍了一种通过度量学进行知识蒸馏以改善文本和音频的嵌入向量的对齐，提出了 Prior-informed Adaptive knowledge Distillation (PAD) 方法，该方法具有更好的文本语音模型之间的传输能力，我们在三个口语理解基准测试中进行了评估。