Pre-trained Transformer models like T5 and BART have advanced the state of
the art on a wide range of text generation tasks. Compressing these models into
smaller ones has become critically important for practical use. Common neural
network compression techniques such as knowledge distillation or quantization
are limited to static compression where the compression ratio is fixed. In this
paper, we introduce Modular Transformers, a modularized encoder-decoder
framework for flexible sequence-to-sequence model compression. Modular
Transformers train modularized layers that have the same function of two or
more consecutive layers in the original model via module replacing and
knowledge distillation. After training, the modularized layers can be flexibly
assembled into sequence-to-sequence models that meet different
performance-efficiency trade-offs. Experimental results show that after a
single training phase, by simply varying the assembling strategy, Modular
Transformers can achieve flexible compression ratios from 1.1x to 6x with
little to moderate relative performance drop.

本文提出了 Modular Transformers 框架，用于灵活的序列到序列模型压缩，通过模块化编码器 - 解码器并进行知识蒸馏，可以实现灵活的压缩比率从 1.1x 到 6x，并且在保持相对性能不变的情况下，可以根据需要灵活组装模块化层。

模块化 Transformer：将 Transformer 压缩为模块化层以进行灵活高效的推理

Modular Transformers: Compressing Transformers into Modularized Layers  for Flexible Efficient Inference

Transformer-based end-to-end speech recognition has achieved great success.
However, the large footprint and computational overhead make it difficult to
deploy these models in some real-world applications. Model compression
techniques can reduce the model size and speed up inference, but the compressed
model has a fixed architecture which might be suboptimal. We propose a novel
Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong
performance-efficiency trade-offs. With a similar number of layers at inference
time, I3D-based models outperform the vanilla Transformer and the static pruned
model via iterative layer pruning. We also present interesting analysis on the
gate probabilities and the input-dependency, which helps us better understand
deep encoders.

该研究提出了一种新的 Transformer 编码器模型，并利用输入依赖动态深度 (I3D) 实现了性能 - 效率的良好均衡，该方法可用于压缩模型大小并通过迭代层剪枝处理以提高模型性能，同时对门控概率和输入依赖性进行了分析以更好地理解深度编码器。

I3D：带有输入依赖的动态深度的 Transformer 架构用于语音识别

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

This paper is a study of performance-efficiency trade-offs in pre-trained
models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and
formalize several architecture designs that influence both the model
performance and its efficiency. Putting together all our observations, we
introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model
architecture with significant improvements along both performance and
efficiency dimensions across a variety of training setups. For example, under
the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in
word error rate. With a similar inference time, SEW reduces word error rate by
25-50% across different model sizes.

对预训练模型在自动语音识别中的性能和效率进行了研究，提出了一种新的模型架构 SEW，其在不同训练环境下都取得了良好的性能和效率。