This paper introduces a novel approach to enhance the capabilities of Large
Language Models (LLMs) in processing and understanding extensive text
sequences, a critical aspect in applications requiring deep comprehension and
synthesis of large volumes of information. Recognizing the inherent challenges
in extending the context window for LLMs, primarily built on Transformer
architecture, we propose a new model architecture, referred to as Zebra. This
architecture efficiently manages the quadratic time and memory complexity
issues associated with full attention in the Transformer by employing grouped
local-global attention layers. Our model, akin to a zebra's alternating
stripes, balances local and global attention layers, significantly reducing
computational requirements and memory consumption. Comprehensive experiments,
including pretraining from scratch, continuation of long context adaptation
training, and long instruction tuning, are conducted to evaluate the Zebra's
performance. The results show that Zebra achieves comparable or superior
performance on both short and long sequence benchmarks, while also enhancing
training and inference efficiency.

介绍了一种增强大型语言模型在处理和理解大量文本序列方面能力的新方法，通过提出一种名为斑马的新型模型架构，有效地处理了 Transformer 中全注意力所带来的二次时间和内存复杂度问题，通过使用分组的局部 - 全局注意力层平衡局部和全局注意力，显著降低了计算需求和内存消耗，同时提高了训练和推理的效率。