This paper introduces channel gating, a dynamic, fine-grained, and
hardware-efficient pruning scheme to reduce the computation cost for
convolutional neural networks (CNNs). Channel gating identifies regions in the
features that contribute less to the classification result, and skips the
computation on a subset of the input channels for these ineffective regions.
Unlike static network pruning, channel gating optimizes CNN inference at
run-time by exploiting input-specific characteristics, which allows
substantially reducing the compute cost with almost no accuracy loss. We
experimentally show that applying channel gating in state-of-the-art networks
achieves 2.7-8.0$\times$ reduction in floating-point operations (FLOPs) and
2.0-4.4$\times$ reduction in off-chip memory accesses with a minimal accuracy
loss on CIFAR-10. Combining our method with knowledge distillation reduces the
compute cost of ResNet-18 by 2.6$\times$ without accuracy drop on ImageNet. We
further demonstrate that channel gating can be realized in hardware
efficiently. Our approach exhibits sparsity patterns that are well-suited to
dense systolic arrays with minimal additional hardware. We have designed an
accelerator for channel gating networks, which can be implemented using either
FPGAs or ASICs. Running a quantized ResNet-18 model for ImageNet, our
accelerator achieves an encouraging speedup of 2.4$\times$ on average, with a
theoretical FLOP reduction of 2.8$\times$.

本研究介绍了通道门控（channel gating）方法，该方法是动态、细粒度且硬件高效的裁剪方案，能够通过跳过对分类结果没有贡献的输入通道的计算，优化卷积神经网络。实验证明，该方法能够在几乎不损失准确度的情况下，实现浮点运算量减少 2.7-8.0 倍，内存访问减少 2.0-4.4 倍，并结合知识蒸馏可以进一步降低计算成本。我们还设计了一个加速器，能够以 2.4 倍的速度进行量化的 ResNet-18 模型的推理，并实现了 2.8 倍的理论 FLOP 减少。

通道调节神经网络

Channel Gating Neural Networks

Generative Adversarial Networks (GANs) are one of the most recent deep
learning models that generate synthetic data from limited genuine datasets.
GANs are on the frontier as further extension of deep learning into many
domains (e.g., medicine, robotics, content synthesis) requires massive sets of
labeled data that is generally either unavailable or prohibitively costly to
collect. Although GANs are gaining prominence in various fields, there are no
accelerators for these new models. In fact, GANs leverage a new operator,
called transposed convolution, that exposes unique challenges for hardware
acceleration. This operator first inserts zeros within the multidimensional
input, then convolves a kernel over this expanded array to add information to
the embedded zeros. Even though there is a convolution stage in this operator,
the inserted zeros lead to underutilization of the compute resources when a
conventional convolution accelerator is employed. We propose the GANAX
architecture to alleviate the sources of inefficiency associated with the
acceleration of GANs using conventional convolution accelerators, making the
first GAN accelerator design possible. We propose a reorganization of the
output computations to allocate compute rows with similar patterns of zeros to
adjacent processing engines, which also avoids inconsequential multiply-adds on
the zeros. This compulsory adjacency reclaims data reuse across these
neighboring processing engines, which had otherwise diminished due to the
inserted zeros. The reordering breaks the full SIMD execution model, which is
prominent in convolution accelerators. Therefore, we propose a unified
MIMD-SIMD design for GANAX that leverages repeated patterns in the computation
to create distinct microprograms that execute concurrently in SIMD mode.

本论文提出了一种名为 GANAX 的新型加速器设计，旨在解决深度生成对抗网络中的卷积精度和硬件加速效率问题，利用重新组织输出计算以及 MIMD-SIMD 统一设计等策略，有效加速了 GAN 的训练和运行。

GANAX：用于生成对抗网络的统一 MIMD-SIMD 加速器

GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial  Networks

The rapid growth of data size and accessibility in recent years has
instigated a shift of philosophy in algorithm design for artificial
intelligence. Instead of engineering algorithms by hand, the ability to learn
composable systems automatically from massive amounts of data has led to
ground-breaking performance in important domains such as computer vision,
speech recognition, and natural language processing. The most popular class of
techniques used in these domains is called deep learning, and is seeing
significant attention from industry. However, these models require incredible
amounts of data and compute power to train, and are limited by the need for
better hardware acceleration to accommodate scaling beyond current data and
model sizes. While the current solution has been to use clusters of graphics
processing units (GPU) as general purpose processors (GPGPU), the use of field
programmable gate arrays (FPGA) provide an interesting alternative. Current
trends in design tools for FPGAs have made them more compatible with the
high-level software practices typically practiced in the deep learning
community, making FPGAs more accessible to those who build and deploy models.
Since FPGA architectures are flexible, this could also allow researchers the
ability to explore model-level optimizations beyond what is possible on fixed
architectures such as GPUs. As well, FPGAs tend to provide high performance per
watt of power consumption, which is of particular importance for application
scientists interested in large scale server-based deployment or
resource-limited embedded applications. This review takes a look at deep
learning and FPGAs from a hardware acceleration perspective, identifying trends
and innovations that make these technologies a natural fit, and motivates a
discussion on how FPGAs may best serve the needs of the deep learning community
moving forward.

本文综述以硬件加速为视角，探讨深度学习及可编程门阵列的发展趋势和革新，旨在讨论 FPGAs 在更好地为深度学习社区提供服务方面的最佳应用。