We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-language models and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for large language models, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.

我们测量了上下文学习的性能，作为任务新颖性和难度与开放和封闭问题之间的函数。我们创建了一个新的基准，由一些难的科学问题和各种相关性的上下文配对。我们证明，与主题更相关的上下文并不总是比较不相关的上下文更有帮助，这个效果在开放问题和高难度或新颖性的问题中尤为明显。这一结果揭示了大型语言模型对封闭形式和开放形式问题处理的根本区别，并显示了对不同类型问题的更稳健的上下文学习评估的需求。它还提出了一个新的问题，即如何在大型语言模型中选择最佳上下文，尤其是在检索加强生成系统（RAG）的上下文中。我们的结果表明，这个问题的答案可能高度依赖于应用，可能取决于问题的形式、问题的感知难度水平以及我们寻求信息的新颖性或受欢迎程度。

为什么上下文学习有时会失败？评估开放与封闭问题上的上下文学习