Large language models (LLMs) have shown promising efficacy across various
tasks, becoming powerful tools in numerous aspects of human life. However,
Transformer-based LLMs suffer a performance degradation when modeling long-term
contexts due to they discard some information to reduce computational overhead.
In this work, we propose a simple yet effective method to enable LLMs to take a
deep breath, encouraging them to summarize information contained within
discrete text chunks. Specifically, we segment the text into multiple chunks
and insert special token <SR> at the end of each chunk. We then modify the
attention mask to integrate the chunk's information into the corresponding <SR>
token. This facilitates LLMs to interpret information not only from historical
individual tokens but also from the <SR> token, aggregating the chunk's
semantic information. Experiments on language modeling and out-of-domain
downstream tasks validate the superiority of our approach.

我们提出了一种简单而有效的方法，通过将文本分割成多个块并在每个块的末尾插入特殊标记 <SR>，修改注意力掩码以将块的信息整合到相应的 <SR> 标记中，从而使 LLMs 能够从历史上的个别标记以及 <SR> 标记中解释信息，从而汇集块的语义信息。通过语言建模和领域外下游任务的实验验证了我们方法的优越性。

深呼吸：用哨兵标记增强大型语言模型的语言建模

Taking a Deep Breath: Enhancing Language Modeling of Large Language  Models with Sentinel Tokens

LSTMs and other RNN variants have shown strong performance on character-level
language modeling. These models are typically trained using truncated
backpropagation through time, and it is common to assume that their success
stems from their ability to remember long-term contexts. In this paper, we show
that a deep (64-layer) transformer model with fixed context outperforms RNN
variants by a large margin, achieving state of the art on two popular
benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good
results at this depth, we show that it is important to add auxiliary losses,
both at intermediate network layers and intermediate sequence positions.

本文通过实验证明，64 层深 (Deep) 的 transformer 模型，通过加入中间网络层和序列位置的辅助损失 (auxiliary losses)，能够在文本压缩数据 (text8) 和维基百科压缩数据 (enwik8) 数据集上超越截断反向传播 (Truncated Backpropagation) 的 RNN 变体，实现 1.13 和 1.06 的最小比特位 (bit per character)。