Large Language Models (LLMs) exhibit positional bias, struggling to utilize
information from the middle or end of long contexts. Our study explores LLMs'
long-context reasoning by probing their hidden representations. We find that
while LLMs encode the position of target information, they often fail to
leverage this in generating accurate responses. This reveals a disconnect
between information retrieval and utilization, a "know but don't tell"
phenomenon. We further analyze the relationship between extraction time and
final accuracy, offering insights into the underlying mechanics of transformer
models.

大型语言模型（LLM）存在位置偏差，难以利用长篇上下文中间或结尾的信息。我们的研究通过探测其隐藏表示来探究 LLMs 的长篇上下文推理能力。我们发现，虽然 LLMs 编码目标信息的位置，但在生成准确回答时往往未能充分利用这一特性。这揭示了信息检索和利用之间的不一致，形成了一种 “知道但不说” 的现象。我们进一步分析了提取时间与最终准确性之间的关系，从而揭示了 Transformer 模型的基本机制。

LLM 长文本语境失误的洞见：当转换器知道但不透露

Insights into LLM Long-Context Failures: When Transformers Know but  Don't Tell

Post-training quantization reduces the computational demand of Large Language
Models (LLMs) but can weaken some of their capabilities. Since LLM abilities
emerge with scale, smaller LLMs are more sensitive to quantization. In this
paper, we explore how quantization affects smaller LLMs' ability to perform
retrieval-augmented generation (RAG), specifically in longer contexts. We chose
personalization for evaluation because it is a challenging domain to perform
using RAG as it requires long-context reasoning over multiple documents. We
compare the original FP16 and the quantized INT4 performance of multiple 7B and
8B LLMs on two tasks while progressively increasing the number of retrieved
documents to test how quantized models fare against longer contexts. To better
understand the effect of retrieval, we evaluate three retrieval models in our
experiments. Our findings reveal that if a 7B LLM performs the task well,
quantization does not impair its performance and long-context reasoning
capabilities. We conclude that it is possible to utilize RAG with quantized
smaller LLMs.

通过评估不同量化方法对不同规模的大型语言模型在长上下文环境下执行反馈增强生成任务的影响，研究发现对于表现良好的较小规模语言模型而言，量化并不会削弱其长上下文推理能力，从而证明了利用量化的较小型语言模型进行反馈增强生成是可行的。