In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize well to closed-domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to empirically explain the performance gap. Our findings suggest that: a) LLMs struggle with dataset demands of closed-domains such as retrieving long answer-spans; b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; c) Scaling model parameters is not always effective for cross-domain generalization; and d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.

本研究探讨了大型语言模型（LLMs）在特定领域（如医学和法律）的零-shot提取式问答能力，旨在解决语言模型在封闭领域的泛化能力不足的问题。通过一系列实验，我们发现LLMs在处理封闭领域的特定需求时表现不佳，尤其是在长答案检索和领域特定词义的区分上，揭示了现有LLMs在应对闭域数据集时的挑战，并为其改进指明了方向。

在低资源提取式问答中的语言模型泛化探索