Diffusion models have shown tremendous results in image generation. However,
due to the iterative nature of the diffusion process and its reliance on
classifier-free guidance, inference times are slow. In this paper, we propose a
new distillation approach for guided diffusion models in which an external
lightweight guide model is trained while the original text-to-image model
remains frozen. We show that our method reduces the inference computation of
classifier-free guided latent-space diffusion models by almost half, and only
requires 1\% trainable parameters of the base model. Furthermore, once trained,
our guide model can be applied to various fine-tuned, domain-specific versions
of the base diffusion model without the need for additional training: this
"plug-and-play" functionality drastically improves inference computation while
maintaining the visual fidelity of generated images. Empirically, we show that
our approach is able to produce visually appealing results and achieve a
comparable FID score to the teacher with as few as 8 to 16 steps.

我们提出了一种新的蒸馏方法，可以减少迭代计算过程中无需分类器指导的扩散模型的推理计算时间，并且只需要基础模型的 1％的可训练参数，同时还能维持生成图像的视觉逼真度。

即插即用扩散蒸馏

Plug-and-Play Diffusion Distillation

Large language models (LLMs) can solve challenging tasks. However, their
inference computation on modern GPUs is highly inefficient due to the
increasing number of tokens they must attend to as they generate new ones. To
address this inefficiency, we capitalize on LLMs' problem-solving capabilities
to optimize their own inference-time efficiency. We demonstrate with two
specific tasks: (a) evaluating complex arithmetic expressions and (b)
summarizing news articles. For both tasks, we create custom datasets to
fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM
learn to solve the evaluation or summarization task, and second, to train it to
identify the minimal attention spans required for each step of the task. As a
result, the fine-tuned model is able to convert these self-identified minimal
attention spans into sparse attention masks on-the-fly during inference. We
develop a custom CUDA kernel to take advantage of the reduced context to attend
to. We demonstrate that using this custom CUDA kernel improves the throughput
of LLM inference by 28%. Our work presents an end-to-end demonstration showing
that training LLMs to self-select their attention spans speeds up
autoregressive inference in solving real-world tasks.

训练大型语言模型自我选择注意力跨度可以加快解决现实世界任务的自回归推理速度。

自选注意力范围加速大型语言模型推理

Self-Selected Attention Span for Accelerating Large Language Model  Inference

Large Language Models (LLMs) with billions of parameters have drastically
transformed AI applications. However, their demanding computation during
inference has raised significant challenges for deployment on
resource-constrained devices. Despite recent trends favoring alternative
activation functions such as GELU or SiLU, known for increased computation,
this study strongly advocates for reinstating ReLU activation in LLMs. We
demonstrate that using the ReLU activation function has a negligible impact on
convergence and performance while significantly reducing computation and weight
transfer. This reduction is particularly valuable during the memory-bound
inference step, where efficiency is paramount. Exploring sparsity patterns in
ReLU-based LLMs, we unveil the reutilization of activated neurons for
generating new tokens and leveraging these insights, we propose practical
strategies to substantially reduce LLM inference computation up to three times,
using ReLU activations with minimal performance trade-offs.

这篇论文研究了大语言模型在资源受限设备上推断计算中的挑战与改进方法，通过重新引入 ReLU 激活函数并探索其稀疏模式，作者提出了一种实用的策略，可以显著减少推断计算量，达到三倍的性能提升。

ReLU 反击：在大型语言模型中利用激活稀疏性

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language  Models

As deep learning (DL) is being rapidly pushed to edge computing, researchers
invented various ways to make inference computation more efficient on
mobile/IoT devices, such as network pruning, parameter compression, and etc.
Quantization, as one of the key approaches, can effectively offload GPU, and
make it possible to deploy DL on fixed-point pipeline. Unfortunately, not all
existing networks design are friendly to quantization. For example, the popular
lightweight MobileNetV1, while it successfully reduces parameter size and
computation latency with separable convolution, our experiment shows its
quantized models have large accuracy gap against its float point models. To
resolve this, we analyzed the root cause of quantization loss and proposed a
quantization-friendly separable convolution architecture. By evaluating the
image classification task on ImageNet2012 dataset, our modified MobileNetV1
model can archive 8-bit inference top-1 accuracy in 68.03%, almost closed the
gap to the float pipeline.

本文分析了 MobileNetV1 量化造成的精度损失问题，并提出了一种友好于量化的可分离卷积架构，经 ImageNet2012 数据集测试，我们改进后的 MobileNetV1 模型能够以 8 位推理精度达到 68.03%，几乎与其浮点精度模型无异。