Test collections play a vital role in evaluation of information retrieval
(IR) systems. Obtaining a diverse set of user queries for test collection
construction can be challenging, and acquiring relevance judgments, which
indicate the appropriateness of retrieved documents to a query, is often costly
and resource-intensive. Generating synthetic datasets using Large Language
Models (LLMs) has recently gained significant attention in various
applications. In IR, while previous work exploited the capabilities of LLMs to
generate synthetic queries or documents to augment training data and improve
the performance of ranking models, using LLMs for constructing synthetic test
collections is relatively unexplored. Previous studies demonstrate that LLMs
have the potential to generate synthetic relevance judgments for use in the
evaluation of IR systems. In this paper, we comprehensively investigate whether
it is possible to use LLMs to construct fully synthetic test collections by
generating not only synthetic judgments but also synthetic queries. In
particular, we analyse whether it is possible to construct reliable synthetic
test collections and the potential risks of bias such test collections may
exhibit towards LLM-based models. Our experiments indicate that using LLMs it
is possible to construct synthetic test collections that can reliably be used
for retrieval evaluation.

使用大型语言模型构建综合人工合成测试集来评估信息检索系统的可行性及存在的潜在偏见风险。

检索评估的合成测试集

Synthetic Test Collections for Retrieval Evaluation

Incomplete relevance judgments limit the re-usability of test collections.
When new systems are compared against previous systems used to build the pool
of judged documents, they often do so at a disadvantage due to the ``holes'' in
test collection (i.e., pockets of un-assessed documents returned by the new
system). In this paper, we take initial steps towards extending existing test
collections by employing Large Language Models (LLM) to fill the holes by
leveraging and grounding the method using existing human judgments. We explore
this problem in the context of Conversational Search using TREC iKAT, where
information needs are highly dynamic and the responses (and, the results
retrieved) are much more varied (leaving bigger holes). While previous work has
shown that automatic judgments from LLMs result in highly correlated rankings,
we find substantially lower correlates when human plus automatic judgments are
used (regardless of LLM, one/two/few shot, or fine-tuned). We further find
that, depending on the LLM employed, new runs will be highly favored (or
penalized), and this effect is magnified proportionally to the size of the
holes. Instead, one should generate the LLM annotations on the whole document
pool to achieve more consistent rankings with human-generated labels. Future
work is required to prompt engineering and fine-tuning LLMs to reflect and
represent the human annotations, in order to ground and align the models, such
that they are more fit for purpose.

利用大型语言模型填补测试集中的空缺，以扩展现有的测试集合，并找出人工注释与自动注释的一致性差异，从而更好地满足人类需求的工作。

我们能利用大型语言模型填补相关性评判空缺吗？

Can We Use Large Language Models to Fill Relevance Judgment Holes?

HC4 is a new suite of test collections for ad hoc Cross-Language Information
Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and
Russian, topics in English and in the document languages, and graded relevance
judgments. New test collections are needed because existing CLIR test
collections built using pooling of traditional CLIR runs have systematic gaps
in their relevance judgments when used to evaluate neural CLIR methods. The HC4
collections contain 60 topics and about half a million documents for each of
Chinese and Persian, and 54 topics and five million documents for Russian.
Active learning was used to determine which documents to annotate after being
seeded using interactive search and judgment. Documents were judged on a
three-grade relevance scale. This paper describes the design and construction
of the new test collections and provides baseline results for demonstrating
their utility for evaluating systems.

本文章介绍了一种新的用于跨语言信息检索的测试集合 HC4，并利用交互搜索和判断以及主动学习方法来构建测试集合，以评估神经 CLIR 方法的效用及提供基准结果。

HC4: 用于 Ad Hoc CLIR 的新测试集

HC4: A New Suite of Test Collections for Ad Hoc CLIR

The TREC Deep Learning (DL) Track studies ad hoc search in the large data
regime, meaning that a large set of human-labeled training data is available.
Results so far indicate that the best models with large data may be deep neural
networks. This paper supports the reuse of the TREC DL test collections in
three ways. First we describe the data sets in detail, documenting clearly and
in one place some details that are otherwise scattered in track guidelines,
overview papers and in our associated MS MARCO leaderboard pages. We intend
this description to make it easy for newcomers to use the TREC DL data. Second,
because there is some risk of iteration and selection bias when reusing a data
set, we describe the best practices for writing a paper using TREC DL data,
without overfitting. We provide some illustrative analysis. Finally we address
a number of issues around the TREC DL data, including an analysis of
reusability.

本文为支持 TREC Deep Learning 的数据重复利用，具体描述了数据集的详细情况，阐述了使用 TREC DL 数据写作论文的最佳实践方法，并对 TREC DL 数据的可重复性进行了分析。

TREC 深度学习赛道：大数据环境中可重用的测试集合

TREC Deep Learning Track: Reusable Test Collections in the Large Data  Regime

Evaluation is crucial in Information Retrieval. The development of models,
tools and methods has significantly benefited from the availability of reusable
test collections formed through a standardized and thoroughly tested
methodology, known as the Cranfield paradigm. Constructing these collections
requires obtaining relevance judgments for a pool of documents, retrieved by
systems participating in an evaluation task; thus involves immense human labor.
To alleviate this effort different methods for constructing collections have
been proposed in the literature, falling under two broad categories: (a)
sampling, and (b) active selection of documents. The former devises a smart
sampling strategy by choosing only a subset of documents to be assessed and
inferring evaluation measure on the basis of the obtained sample; the sampling
distribution is being fixed at the beginning of the process. The latter
recognizes that systems contributing documents to be judged vary in quality,
and actively selects documents from good systems. The quality of systems is
measured every time a new document is being judged. In this paper we seek to
solve the problem of large-scale retrieval evaluation combining the two
approaches. We devise an active sampling method that avoids the bias of the
active selection methods towards good systems, and at the same time reduces the
variance of the current sampling approaches by placing a distribution over
systems, which varies as judgments become available. We validate the proposed
method using TREC data and demonstrate the advantages of this new method
compared to past approaches.

本文提出一种结合两种方法 —— 抽样和主动选择文档 —— 的大规模信息检索评估方法，其通过向系统分配分布并在评估过程中修改来减少样本偏差，并使用 TREC 数据验证其优点。