Large language models (LLMs) have shown nearly saturated performance on many
natural language processing (NLP) tasks. As a result, it is natural for people
to believe that LLMs have also mastered abilities such as time understanding
and reasoning. However, research on the temporal sensitivity of LLMs has been
insufficiently emphasized. To fill this gap, this paper constructs Multiple
Sensitive Factors Time QA (MenatQA), which encompasses three temporal factors
(scope factor, order factor, counterfactual factor) with total 2,853 samples
for evaluating the time comprehension and reasoning abilities of LLMs. This
paper tests current mainstream LLMs with different parameter sizes, ranging
from billions to hundreds of billions. The results show most LLMs fall behind
smaller temporal reasoning models with different degree on these factors. In
specific, LLMs show a significant vulnerability to temporal biases and depend
heavily on the temporal information provided in questions. Furthermore, this
paper undertakes a preliminary investigation into potential improvement
strategies by devising specific prompts and leveraging external tools. These
approaches serve as valuable baselines or references for future research
endeavors.

本论文通过构建 MenatQA 来评估大型语言模型（LLMs）在时间理解和推理能力方面的表现，并测试了不同参数大小的主流 LLMs。结果表明，大多数 LLMs 在处理不同程度的时间因素时不如更小的时间推理模型，并且对时间偏差的敏感度较高，且严重依赖于问题中提供的时间信息。此外，本文还探索了通过具体提示和外部工具来改进 LLMs 的潜在策略，为未来的研究提供了有价值的基准或参考。