Large Language Models (LLMs) with hundreds of billions of parameters have
transformed the field of machine learning. However, serving these models at
inference time is both compute and memory intensive, where a single request can
require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is
one of the key components of LLMs, which can account for over 50% of LLMs
memory and compute requirement. We observe that there is a high amount of
redundancy across heads on which tokens they pay attention to. Based on this
insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a
high amount of correlation for self-attention at runtime, thus reducing both
memory and compute. In our experiments, we show that CHAI is able to reduce the
memory requirements for storing K,V cache by up to 21.4% and inference time
latency by up to 1.73x without any fine-tuning required. CHAI achieves this
with a maximum 3.2% deviation in accuracy across 3 different models (i.e.
OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.

基于大型语言模型的多头注意力机制的高冗余性，提出了一种新的聚类头自注意力机制 (CHAI)，能够在运行时显著降低模型的存储和计算需求，从而减少内存需求 21.4% 和推理时间延迟最多 1.73 倍。

CHAI: 集群化头部注意力用于高效的 LLM 推断

CHAI: Clustered Head Attention for Efficient LLM Inference

The real-time processing of time series signals is a critical issue for many
real-life applications. The idea of real-time processing is especially
important in audio domain as the human perception of sound is sensitive to any
kind of disturbance in perceived signals, especially the lag between auditory
and visual modalities. The rise of deep learning (DL) models complicated the
landscape of signal processing. Although they often have superior quality
compared to standard DSP methods, this advantage is diminished by higher
latency. In this work we propose novel method for minimization of inference
time latency and memory consumption, called Short-Term Memory Convolution
(STMC) and its transposed counterpart. The main advantage of STMC is the low
latency comparable to long short-term memory (LSTM) networks. Furthermore, the
training of STMC-based models is faster and more stable as the method is based
solely on convolutional neural networks (CNNs). In this study we demonstrate an
application of this solution to a U-Net model for a speech separation task and
GhostNet model in acoustic scene classification (ASC) task. In case of speech
separation we achieved a 5-fold reduction in inference time and a 2-fold
reduction in latency without affecting the output quality. The inference time
for ASC task was up to 4 times faster while preserving the original accuracy.

本文提出了一种名为 Short-Term Memory Convolution（STMC）的卷积神经网络方法，用于音频领域中的实时处理，该方法能够以低延迟比 LSTM 网络更稳定快速地进行训练和推理，实现了语音分离和声场分类的更快速度和更高的准确性。