The large number of parameters in Pretrained Language Models enhance their
performance, but also make them resource-intensive, making it challenging to
deploy them on commodity hardware like a single GPU. Due to the memory and
power limitations of these devices, model compression techniques are often used
to decrease both the model's size and its inference latency. This usually
results in a trade-off between model accuracy and efficiency. Therefore,
optimizing this balance is essential for effectively deploying LLMs on
commodity hardware. A significant portion of the efficiency challenge is the
Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$
total parameters and inference latency. In this paper, we first observe that
only a few neurons of FFN module have large output norm for any input tokens,
a.k.a. heavy hitters, while the others are sparsely triggered by different
tokens. Based on this observation, we explicitly split the FFN into two parts
according to the heavy hitters. We improve the efficiency-accuracy trade-off of
existing compression methods by allocating more resource to FFN parts with
heavy hitters. In practice, our method can reduce model size by 43.1\% and
bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with
negligible accuracy drop.

优化预训练语言模型（PLM）在商用硬件上的部署，通过模型压缩技术提高效率，将 Feed-forward 网络划分为两部分以提高已有压缩方法的效果，并取得了可观的模型尺寸减小和推理速度提升的效果。

FFSplit：一种用于优化语言模型推理精度和效率权衡的分割前馈网络

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency  Trade-off in Language Model Inference

Recently, the efficient deployment and acceleration of powerful vision
transformers (ViTs) on resource-limited edge devices for providing multimedia
services have become attractive tasks. Although early exiting is a feasible
solution for accelerating inference, most works focus on convolutional neural
networks (CNNs) and transformer models in natural language processing
(NLP).Moreover, the direct application of early exiting methods to ViTs may
result in substantial performance degradation. To tackle this challenge, we
systematically investigate the efficacy of early exiting in ViTs and point out
that the insufficient feature representations in shallow internal classifiers
and the limited ability to capture target semantic information in deep internal
classifiers restrict the performance of these methods. We then propose an early
exiting framework for general ViTs termed LGViT, which incorporates
heterogeneous exiting heads, namely, local perception head and global
aggregation head, to achieve an efficiency-accuracy trade-off. In particular,
we develop a novel two-stage training scheme, including end-to-end training and
self-distillation with the backbone frozen to generate early exiting ViTs,
which facilitates the fusion of global and local information extracted by the
two types of heads. We conduct extensive experiments using three popular ViT
backbones on three vision datasets. Results demonstrate that our LGViT can
achieve competitive performance with approximately 1.8 $\times$ speed-up.

我们提出了一种早期退出的通用 ViTs 框架 LGViT，它通过引入异质退出头，包括局部感知头和全局聚合头，实现了效率和准确性的权衡，通过两阶段的训练方案，包括端到端训练和带有冻结骨干的自蒸馏，生成了早期退出的 ViTs，进一步促进了由这两种类型的头提取的全局和局部信息的融合，实验证明我们的 LGViT 能够在大约 1.8 倍的速度提升的同时保持竞争力的性能。

LGViT：动态早期退出以加速视觉 Transformer

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Temporal action detection (TAD) is an important yet challenging task in video
understanding. It aims to simultaneously predict the semantic label and the
temporal interval of every action instance in an untrimmed video. Rather than
end-to-end learning, most existing methods adopt a head-only learning paradigm,
where the video encoder is pre-trained for action classification, and only the
detection head upon the encoder is optimized for TAD. The effect of end-to-end
learning is not systematically evaluated. Besides, there lacks an in-depth
study on the efficiency-accuracy trade-off in end-to-end TAD. In this paper, we
present an empirical study of end-to-end temporal action detection. We validate
the advantage of end-to-end learning over head-only learning and observe up to
11\% performance improvement. Besides, we study the effects of multiple design
choices that affect the TAD performance and speed, including detection head,
video encoder, and resolution of input videos. Based on the findings, we build
a mid-resolution baseline detector, which achieves the state-of-the-art
performance of end-to-end methods while running more than 4$\times$ faster. We
hope that this paper can serve as a guide for end-to-end learning and inspire
future research in this field. Code and models are available at
https://github.com/xlliu7/E2E-TAD.

本文介绍了一种基于端到端学习的方法进行时间动作检测，相对于只有检测头优化的方法，端到端学习可以带来多达 11% 的性能改进，并针对影响 TAD 性能和速度的多种设计选择进行了深入研究，并提出了更高效的检测器。