The rapid advancement of Large Language Models (LLMs) highlights the urgent
need for evolving evaluation methodologies that keep pace with improvements in
language comprehension and information processing. However, traditional
benchmarks, which are often static, fail to capture the continually changing
information landscape, leading to a disparity between the perceived and actual
effectiveness of LLMs in ever-changing real-world scenarios. Furthermore, these
benchmarks do not adequately measure the models' capabilities over a broader
temporal range or their adaptability over time. We examine current LLMs in
terms of temporal generalization and bias, revealing that various temporal
biases emerge in both language likelihood and prognostic prediction. This
serves as a caution for LLM practitioners to pay closer attention to mitigating
temporal biases. Also, we propose an evaluation framework Freshbench for
dynamically generating benchmarks from the most recent real-world
prognostication prediction. Our code is available at
this https URL The dataset will be released
soon.

大语言模型的发展迫切需要与语言理解和信息处理的提升相适应的评估方法。我们检查了当前的大语言模型，并揭示了它们在时间推理和偏见方面存在的各种时间偏见。我们提出了一个评估框架 Freshbench，用于动态生成最新的现实世界预测性预测的评估基准。

评估 LLMs 在时间泛化上的表现

Evaluating LLMs at Evaluating Temporal Generalization

Climate models are biased with respect to real world observations and usually
need to be calibrated prior to impact studies. The suite of statistical methods
that enable such calibrations is called bias correction (BC). However, current
BC methods struggle to adjust for temporal biases, because they disregard the
dependence between consecutive time-points. As a result, climate statistics
with long-range temporal properties, such as heatwave duration and frequency,
cannot be corrected accurately, making it more difficult to produce reliable
impact studies on such climate statistics. In this paper, we offer a novel BC
methodology to correct for temporal biases. This is made possible by i)
re-thinking BC as a probability model rather than an algorithmic procedure, and
ii) adapting state-of-the-art machine-learning (ML) probabilistic attention
models to fit the BC task. With a case study of heatwave duration statistics in
Abuja, Nigeria, and Tokyo, Japan, we show striking results compared to current
climate model outputs and alternative BC methods.

通过将偏差修正方法（BC）重新构想为概率模型而非算法过程，并使用先进的机器学习（ML）概率注意力模型适应 BC 任务，我们提供了一种新颖的 BC 方法来纠正时间偏差，以产生更可靠的对气候统计的影响研究。

使用机器学习注意力模型进行时间偏差校正

A Temporal Bias Correction using a Machine Learning Attention model

Large language models (LLMs) have shown nearly saturated performance on many
natural language processing (NLP) tasks. As a result, it is natural for people
to believe that LLMs have also mastered abilities such as time understanding
and reasoning. However, research on the temporal sensitivity of LLMs has been
insufficiently emphasized. To fill this gap, this paper constructs Multiple
Sensitive Factors Time QA (MenatQA), which encompasses three temporal factors
(scope factor, order factor, counterfactual factor) with total 2,853 samples
for evaluating the time comprehension and reasoning abilities of LLMs. This
paper tests current mainstream LLMs with different parameter sizes, ranging
from billions to hundreds of billions. The results show most LLMs fall behind
smaller temporal reasoning models with different degree on these factors. In
specific, LLMs show a significant vulnerability to temporal biases and depend
heavily on the temporal information provided in questions. Furthermore, this
paper undertakes a preliminary investigation into potential improvement
strategies by devising specific prompts and leveraging external tools. These
approaches serve as valuable baselines or references for future research
endeavors.

本论文通过构建 MenatQA 来评估大型语言模型（LLMs）在时间理解和推理能力方面的表现，并测试了不同参数大小的主流 LLMs。结果表明，大多数 LLMs 在处理不同程度的时间因素时不如更小的时间推理模型，并且对时间偏差的敏感度较高，且严重依赖于问题中提供的时间信息。此外，本文还探索了通过具体提示和外部工具来改进 LLMs 的潜在策略，为未来的研究提供了有价值的基准或参考。