Building effective and efficient Transformer-based large language models (LLMs) has recently become a research focus, requiring maximizing model language capabilities and minimizing training and deployment costs. Existing efforts have primarily described complex relationships among model performance, parameter size, and data size, as well as searched for the optimal compute allocation to train LLMs. However, they overlook the impacts of context length and attention head configuration (the number of query and key-value heads in grouped-query attention) on training and inference. In this paper, we systematically compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Then, we extend the existing scaling methods, which are based solely on parameter size and training compute, to guide the construction of cost-optimal LLMs during both training and inference. Our quantitative scaling studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs. Our findings provide valuable insights for developing practical LLMs, especially in long-context processing scenarios. We will publicly release our code and data.

本研究解决了现有大型语言模型（LLMs）在处理长上下文时对上下文长度和注意力头配置的忽视。我们提出了一种系统比较不同参数规模、上下文长度和注意力头配置的方法，并扩展了现有的缩放方法，以指南成本最优的LLM构建。研究结果表明，在处理长序列时，较大的模型与较少的注意力头能够以更低的计算和内存成本实现更低的损失，为实际LLMs的发展提供了重要启示。

长上下文大语言模型的成本最优分组查询注意力