Recent works show that reducing the number of layers in a convolutional
neural network can enhance efficiency while maintaining the performance of the
network. Existing depth compression methods remove redundant non-linear
activation functions and merge the consecutive convolution layers into a single
layer. However, these methods suffer from a critical drawback; the kernel size
of the merged layers becomes larger, significantly undermining the latency
reduction gained from reducing the depth of the network. We show that this
problem can be addressed by jointly pruning convolution layers and activation
functions. To this end, we propose LayerMerge, a novel depth compression method
that selects which activation layers and convolution layers to remove, to
achieve a desired inference speed-up while minimizing performance loss. Since
the corresponding selection problem involves an exponential search space, we
formulate a novel surrogate optimization problem and efficiently solve it via
dynamic programming. Empirical results demonstrate that our method consistently
outperforms existing depth compression and layer pruning methods on various
network architectures, both on image classification and generation tasks. We
release the code at this https URL

通过共同修剪卷积层和激活函数来提高卷积神经网络的效率，并实现所需的推理加速度，同时尽量减少性能损失。

LayerMerge: 神经网络深度压缩通过层修剪和合并

LayerMerge: Neural Network Depth Compression through Layer Pruning and  Merging

The recent surge of large language models (LLMs) highlights their ability to
perform in-context learning, i.e., "learning" to perform a task from a few
demonstrations in the context without any parameter updates. However, their
capabilities of in-context learning are limited by the model architecture: 1)
the use of demonstrations is constrained by a maximum sentence length due to
positional embeddings; 2) the quadratic complexity of attention hinders users
from using more demonstrations efficiently; 3) LLMs are shown to be sensitive
to the order of the demonstrations. In this work, we tackle these challenges by
proposing a better architectural design for in-context learning. We propose
SAICL (Structured Attention for In-Context Learning), which replaces the
full-attention by a structured attention mechanism designed for in-context
learning, and removes unnecessary dependencies between individual
demonstrations, while making the model invariant to the permutation of
demonstrations. We evaluate SAICL in a meta-training framework and show that
SAICL achieves comparable or better performance than full attention while
obtaining up to 3.4x inference speed-up. SAICL also consistently outperforms a
strong Fusion-in-Decoder (FiD) baseline which processes each demonstration
independently. Finally, thanks to its linear nature, we demonstrate that SAICL
can easily scale to hundreds of demonstrations with continuous performance
gains with scaling.

提出了一个用于上下文学习的更好的架构设计 SAICL（Structured Attention for In-Context Learning），该架构通过将全注意力替换为专为上下文学习设计的结构化注意力机制，并消除个体示范之间的不必要依赖性，同时使模型对示范的排列具有不变性。在元训练框架中评估 SAICL，并显示出与全注意力相当或更好的性能，同时获得最多 3.4 倍的推理加速。SAICL 还始终优于每个示范独立处理的强基线 Fusion-in-Decoder（FiD）。最后，由于其线性特性，我们证明 SAICL 可以轻松扩展到数百个示范，并实现连续的性能增益。