Large language models (LLMs) have recently gained significant attention due
to their unparalleled ability to perform various natural language processing
tasks. These models, benefiting from their advanced natural language
understanding capabilities, have demonstrated impressive zero-shot performance.
However, the pre-training data utilized in LLMs is often confined to a specific
corpus, resulting in inherent freshness and temporal scope limitations.
Consequently, this raises concerns regarding the effectiveness of LLMs for
tasks involving temporal intents. In this study, we aim to investigate the
underlying limitations of general-purpose LLMs when deployed for tasks that
require a temporal understanding. We pay particular attention to handling
factual temporal knowledge through three popular temporal QA datasets.
Specifically, we observe low performance on detailed questions about the past
and, surprisingly, for rather new information. In manual and automatic testing,
we find multiple temporal errors and characterize the conditions under which QA
performance deteriorates. Our analysis contributes to understanding LLM
limitations and offers valuable insights into developing future models that can
better cater to the demands of temporally-oriented tasks. The code is
available\footnote{this https URL}.

该研究旨在调查通用型大型语言模型在需要时间理解的任务中面临的潜在限制，针对三个流行的时间问答数据集，我们发现大型语言模型在关于过去和新信息的详细问题上表现较差，存在多个时间错误，我们的分析有助于理解大型语言模型的局限性，并为开发更好满足时间导向任务需求的未来模型提供有价值的见解。

大型语言模型中的时间盲点

Temporal Blind Spots in Large Language Models

Question Answering (QA) systems require a large amount of annotated data
which is costly and time-consuming to gather. Converting datasets of existing
QA benchmarks are challenging due to different formats and complexities. To
address these issues, we propose an algorithm to automatically generate shorter
questions resembling day-to-day human communication in the Natural Questions
(NQ) dataset from longer trivia questions in Quizbowl (QB) dataset by
leveraging conversion in style among the datasets. This provides an automated
way to generate more data for our QA systems. To ensure quality as well as
quantity of data, we detect and remove ill-formed questions using a neural
classifier. We demonstrate that in a low resource setting, using the generated
data improves the QA performance over the baseline system on both NQ and QB
data. Our algorithm improves the scalability of training data while maintaining
quality of data for QA systems.

本研究提出一种算法，利用数据集转换技术从长的 Trivia 问题转化为类似于日常人类交流的较短问题的方式，自动生成自然问题（NQ）数据集中的问题，同时使用神经分类器检测并去除不合法的问题，从而生成高质量的数据集，提高了 QA 表现，该算法在低资源环境下使用，扩展了 QA 系统的规模，同时保持了训练数据的质量。

利用生成 NQ 类问题来改进问答

Improving Question Answering with Generation of NQ-like Questions

Synthesizing QA pairs with a question generator (QG) on the target domain has
become a popular approach for domain adaptation of question answering (QA)
models. Since synthetic questions are often noisy in practice, existing work
adapts scores from a pretrained QA (or QG) model as criteria to select
high-quality questions. However, these scores do not directly serve the
ultimate goal of improving QA performance on the target domain. In this paper,
we introduce a novel idea of training a question value estimator (QVE) that
directly estimates the usefulness of synthetic questions for improving the
target-domain QA performance. By conducting comprehensive experiments, we show
that the synthetic questions selected by QVE can help achieve better
target-domain QA performance, in comparison with existing techniques. We
additionally show that by using such questions and only around 15% of the human
annotations on the target domain, we can achieve comparable performance to the
fully-supervised baselines.

本文提出了一种新颖的问题价值估计器（QVE），它可以直接估计合成问题对于提高目标领域问答（QA）性能的有用性。通过综合实验，我们发现 QVE 选择的合成问题可以帮助实现比现有技术更好的目标领域 QA 表现，并且通过使用这些问题并仅使用目标领域 15％左右的人类注释，我们可以实现与完全监督基线相当的性能。