Visual Attention Networks (VAN) with Large Kernel Attention (LKA) modules
have been shown to provide remarkable performance, that surpasses Vision
Transformers (ViTs), on a range of vision-based tasks. However, the depth-wise
convolutional layer in these LKA modules incurs a quadratic increase in the
computational and memory footprints with increasing convolutional kernel size.
To mitigate these problems and to enable the use of extremely large
convolutional kernels in the attention modules of VAN, we propose a family of
Large Separable Kernel Attention modules, termed LSKA. LSKA decomposes the 2D
convolutional kernel of the depth-wise convolutional layer into cascaded
horizontal and vertical 1-D kernels. In contrast to the standard LKA design,
the proposed decomposition enables the direct use of the depth-wise
convolutional layer with large kernels in the attention module, without
requiring any extra blocks. We demonstrate that the proposed LSKA module in VAN
can achieve comparable performance with the standard LKA module and incur lower
computational complexity and memory footprints. We also find that the proposed
LSKA design biases the VAN more toward the shape of the object than the texture
with increasing kernel size. Additionally, we benchmark the robustness of the
LKA and LSKA in VAN, ViTs, and the recent ConvNeXt on the five corrupted
versions of the ImageNet dataset that are largely unexplored in the previous
works. Our extensive experimental results show that the proposed LSKA module in
VAN provides a significant reduction in computational complexity and memory
footprints with increasing kernel size while outperforming ViTs, ConvNeXt, and
providing similar performance compared to the LKA module in VAN on object
recognition, object detection, semantic segmentation, and robustness tests.

通过将深度可分离卷积核的二维卷积核分解为级联的水平和垂直一维卷积核，提出了一种名为 Large Separable Kernel Attention（LSKA）模块的家族，用于减少计算复杂性和内存占用，同时在视觉注意力网络（VAN）中实现具有大卷积核的注意力模块，并表明 LSKA 模块比 VAN 中的标准 LKA 模块具有更大的目标形状偏好和较低的计算复杂度和内存占用。

大型可分离核注意力：重新思考 CNN 中的大型核注意力设计

Large Separable Kernel Attention: Rethinking the Large Kernel Attention  Design in CNN

Operating deep neural networks on devices with limited resources requires the
reduction of their memory footprints and computational requirements. In this
paper we introduce a training method, called look-up table quantization, LUT-Q,
which learns a dictionary and assigns each weight to one of the dictionary's
values. We show that this method is very flexible and that many other
techniques can be seen as special cases of LUT-Q. For example, we can constrain
the dictionary trained with LUT-Q to generate networks with pruned weight
matrices or restrict the dictionary to powers-of-two to avoid the need for
multiplications. In order to obtain fully multiplier-less networks, we also
introduce a multiplier-less version of batch normalization. Extensive
experiments on image recognition and object detection tasks show that LUT-Q
consistently achieves better performance than other methods with the same
quantization bitwidth.

本研究介绍了一种叫做 LUT-Q 的训练方法，它可以学习一个字典并将每个权重分配给字典中的一个值，以减少深度神经网络的内存和计算需求。我们的实验结果表明，LUT-Q 比其他同种量化位宽的方法表现更好，并提出了一种无乘法器批归一化的算法。