This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.

本研究探讨了扩展输入长度对大型语言模型 (LLMs) 能力的影响。通过引入一种新型问答推理框架，重点评估输入长度对性能的影响。结果显示，在远低于技术最大值的输入长度时，LLMs 的推理性能显著下降，而且这种降级趋势在数据集的每个版本中都存在，尽管强度有所不同。此外，研究还发现传统的困惑度度量与 LLMS 在长输入推理任务中的性能无关。通过分析结果，我们鉴定了失效模式，这些模式对未来的研究可能具有指导意义，并有望解决 LLMS 中观察到的限制。

相同任务，更多令牌：输入长度对大型语言模型推理性能的影响