Large Language Models (LLMs) are trained on vast amounts of data, most of
which is automatically scraped from the internet. This data includes
encyclopedic documents that harbor a vast amount of general knowledge (e.g.,
Wikipedia) but also potentially overlap with benchmark datasets used for
evaluating LLMs. Consequently, evaluating models on test splits that might have
leaked into the training set is prone to misleading conclusions. To foster
sound evaluation of language models, we introduce a new test dataset named
RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a
collection of five splits of test sets, four of which have not been released to
the internet or exposed to LLM APIs prior to this publication. Each sample in
RepLiQA comprises (1) a reference document crafted by a human annotator and
depicting an imaginary scenario (e.g., a news article) absent from the
internet; (2) a question about the document's topic; (3) a ground-truth answer
derived directly from the information in the document; and (4) the paragraph
extracted from the reference document containing the answer. As such, accurate
answers can only be generated if a model can find relevant content within the
provided document. We run a large-scale benchmark comprising several
state-of-the-art LLMs to uncover differences in performance across models of
various types and sizes in a context-conditional language modeling setting.
Released splits of RepLiQA can be found here:
this https URL

通过介绍一个名为 RepLiQA 的新测试数据集，本研究试图解决使用互联网数据进行大型语言模型评估时可能出现的问题，并通过对各种型号和规模的模型进行基准测试，揭示它们在不同情境条件下的性能差异。

RepLiQA：用于评估 LLMs 在未见参考内容上的问答数据集

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen  Reference Content

Large language models (LLMs) have garnered significant attention, but the
definition of "large" lacks clarity. This paper focuses on medium-sized
lan-guage models (MLMs), defined as having at least six billion parameters but
less than 100 billion. The study evaluates MLMs regarding zero-shot genera-tive
question answering, which requires models to provide elaborate answers without
external document retrieval. The paper introduces an own test da-taset and
presents results from human evaluation. Results show that combin-ing the best
answers from different MLMs yielded an overall correct answer rate of 82.7%
which is better than the 60.9% of ChatGPT. The best MLM achieved 46.4% and has
7B parameters, which highlights the importance of using appropriate training
data for fine-tuning rather than solely relying on the number of parameters.
More fine-grained feedback should be used to further improve the quality of
answers.

本文研究中等规模的语言模型在零 - shot 生成问答方面的性能，评估结果表明最佳模型的回答率可达 46.4％，使用适当的训练数据进行微调比仅仅依靠参数数量更为重要。

中大型语言模型零样本闭卷生成问答的评估

Evaluation of medium-large Language Models at zero-shot closed book  generative question answering

In real-world classification problems, the class balance in the training
dataset does not necessarily reflect that of the test dataset, which can cause
significant estimation bias. If the class ratio of the test dataset is known,
instance re-weighting or resampling allows systematical bias correction.
However, learning the class ratio of the test dataset is challenging when no
labeled data is available from the test domain. In this paper, we propose to
estimate the class ratio in the test dataset by matching probability
distributions of training and test input data. We demonstrate the utility of
the proposed approach through experiments.

本文提出通过匹配训练和测试输入数据的概率分布来估计测试数据集中的类比率，从而解决在缺乏测试域标签数据时学习测试数据集中的类比率所带来的问题。