This paper presents the Block Transformer architecture which adopts
hierarchical global-to-local modeling to autoregressive transformers to
mitigate the inference bottlenecks of self-attention. To apply self-attention,
the key-value (KV) cache of all previous sequences must be retrieved from
memory at every decoding step. Thereby, this KV cache IO becomes a significant
bottleneck in batch inference. We notice that these costs stem from applying
self-attention on the global context, therefore we isolate the expensive
bottlenecks of global modeling to lower layers and apply fast local modeling in
upper layers. To mitigate the remaining costs in the lower layers, we aggregate
input tokens into fixed size blocks and then apply self-attention at this
coarse level. Context information is aggregated into a single embedding to
enable upper layers to decode the next block of tokens, without global
attention. Free of global attention bottlenecks, the upper layers can fully
utilize the compute hardware to maximize inference throughput. By leveraging
global and local modules, the Block Transformer architecture demonstrates
10-20x gains in inference throughput compared to vanilla transformers with
equivalent perplexity. Our work introduces a new approach to optimize language
model inference through novel application of global-to-local modeling. Code is
available at this https URL

通过采用分层的全局到局部建模的方法，本研究提出了块变压器架构，以缓解自注意力所带来的推理瓶颈。通过在较低层应用快速局部建模和在较高层应用全局建模，以减轻与全局上下文相关的计算代价，并通过聚合输入令牌来降低较低层的计算代价，在没有全局注意力瓶颈的情况下，较高层可以充分利用计算硬件以最大化推理吞吐量，并优化了语言模型推理过程。

块级 Transformer：全局到局部的语言建模以提高快速推理能力

Block Transformer: Global-to-Local Language Modeling for Fast Inference

In our study, we present bifurcated attention, a method developed for
language model inference in single-context batch sampling contexts. This
approach aims to reduce redundant memory IO costs, a significant factor in
latency for high batch sizes and long context lengths. Bifurcated attention
achieves this by dividing the attention mechanism during incremental decoding
into two distinct GEMM operations, focusing on the KV cache from prefill and
the decoding process. This method ensures precise computation and maintains the
usual computational load (FLOPs) of standard attention mechanisms, but with
reduced memory IO. Bifurcated attention is also compatible with multi-query
attention mechanism known for reduced memory IO for KV cache, further enabling
higher batch size and context length. The resulting efficiency leads to lower
latency, improving suitability for real-time applications, e.g., enabling
massively-parallel answer generation without substantially increasing latency,
enhancing performance when integrated with postprocessing techniques such as
reranking.

我们的研究提出了分叉注意力，这是一种用于单一上下文批次采样环境中的语言模型推断的方法。该方法通过将注意机制在增量解码过程中划分为两个不同的 GEMM 操作，分别聚焦于预装填的 KV 缓存和解码过程，以降低冗余的内存 IO 成本，从而实现精确计算并保持标准注意机制的常规计算负载（FLOPs），但减少内存 IO。分叉注意力还与已知用于降低内存 IO 的多查询注意力机制兼容，进一步支持更大的批次大小和上下文长度。因此，该方法的高效性能导致更低的延迟，提高了其适用性，例如在实时应用中实现了并行的答案生成，而不会显著增加延迟，并且在与后处理技术如重新排序相结合时，性能得到了提升。

单一背景大批量采样的分叉注意力

Bifurcated Attention for Single-Context Large-Batch Sampling

Current prompting approach for language model inference mainly rely on
Language Model's (LLM) autonomous exploration of reasoning paths, confronts an
inevitable retracing operation when erroneous routes are encountered. This is
followed by the pursuit of alternative reasoning paths. However, humans are
adept at abstracting optimal solutions from problems, thereby facilitating
swift and precise reasoning for similar problems resolution. In light of this,
we delves into the potential of harnessing expert knowledge to enhance
problem-solving within LLMs. We introduce a novel paradigm, the State Machine
of Thought (SMoT), which employs predefined state machines to furnish LLMs with
efficient reasoning paths, thereby eliminating fruitless exploration.
Furthermore, we propose a multi-agent mechanism that assigns different
objectives to agents, aiming to enhance the accuracy of SMoT reasoning. The
experimental results, derived from an array reasoning task, reveal that SMoT
realizes an extraordinary accuracy of 95\%, surpassing the performance of the
state-of-the-art baselines.

利用专家知识增强语言模型的问题解决能力，提出了一种新的范式 SMoT，通过预定义的状态机为语言模型提供高效推理路径，消除无效的探索，实验结果显示 SMoT 在推理任务中表现出了卓越的准确性，高达 95％，超过了当前最先进的基准模型。