Deep clustering, a method for partitioning complex, high-dimensional data
using deep neural networks, presents unique evaluation challenges. Traditional
clustering validation measures, designed for low-dimensional spaces, are
problematic for deep clustering, which involves projecting data into
lower-dimensional embeddings before partitioning. Two key issues are
identified: 1) the curse of dimensionality when applying these measures to raw
data, and 2) the unreliable comparison of clustering results across different
embedding spaces stemming from variations in training procedures and parameter
settings in different clustering models. This paper addresses these challenges
in evaluating clustering quality in deep learning. We present a theoretical
framework to highlight ineffectiveness arising from using internal validation
measures on raw and embedded data and propose a systematic approach to applying
clustering validity indices in deep clustering contexts. Experiments show that
this framework aligns better with external validation measures, effectively
reducing the misguidance from the improper use of clustering validity indices
in deep learning.

利用深度神经网络对复杂、高维数据进行分区的深度聚类方法存在独特的评估挑战，传统的聚类验证度量方法因适用于低维空间而在深度聚类中存在问题，本文针对在深度学习中评估聚类质量的问题进行了研究，提出了一个理论框架来突出使用内部验证度量方法在原始数据和嵌入数据上的无效性，并在深度聚类上提出了一种系统性的聚类有效性指标的应用方法，实验证明这个框架与外部验证度量方法更加吻合，有效地减少了在深度学习中不正确使用聚类有效性指标所引发的误导。

深度聚类评估：如何验证内部聚类验证指标

Deep Clustering Evaluation: How to Validate Internal Clustering  Validation Measures

Due to the expanding capabilities and pre-training data, Large Language
Models (LLMs) are facing increasingly serious evaluation challenges. On one
hand, the data leakage issue cause over-estimation on existing benchmarks. On
the other hand, periodically curating datasets manually is costly. In this
paper, we propose to automate dataset updates for reliable and timely
evaluation. The basic idea is to generate unseen and high-quality testing
samples based on existing ones to mitigate leakage issues. In specific, we
propose two strategies with systematically verification. First, the mimicking
strategy employs LLMs to create new samples resembling existing ones, to the
maximum extent preserving the stylistic of the original dataset. Our
experiments demonstrate its evaluation stability across multiple instantiations
and its effectiveness in dealing with data leakage issues in most cases.
Second, for the cases that mimicking dataset works poorly, we design an
extending strategy that adjusts the difficulty of the generated samples
according to varying cognitive levels. This not only makes our evaluation more
systematic, but also, with a balanced difficulty, even discern model
capabilities better at fine-grained levels.

通过自动化数据集更新以可靠且及时进行评估，来解决大型语言模型面临的评估挑战及数据泄漏问题。

自动化数据集更新以实现可靠和及时评估

Have Seen Me Before? Automating Dataset Updates Towards Reliable and  Timely Evaluation

Event extraction has attracted much attention in recent years due to its
potential for many applications. However, recent studies observe some
evaluation challenges, suggesting that reported scores might not reflect the
true performance. In this work, we first identify and discuss these evaluation
challenges, including the unfair comparisons resulting from different
assumptions about data or different data preprocessing steps, the
incompleteness of the current evaluation framework leading to potential dataset
bias or data split bias, and low reproducibility of prior studies. To address
these challenges, we propose TextEE, a standardized, fair, and reproducible
benchmark for event extraction. TextEE contains standardized data preprocessing
scripts and splits for more than ten datasets across different domains. In
addition, we aggregate and re-implement over ten event extraction approaches
published in recent years and conduct a comprehensive reevaluation. Finally, we
explore the capability of large language models in event extraction and discuss
some future challenges. We expect TextEE will serve as a reliable benchmark for
event extraction, facilitating future research in the field.

本文讨论和解决事件提取评估中的挑战，并提出了 TextEE 作为一个标准化、公平和可重现的事件提取基准，包含了多个领域的标准化数据预处理脚本和数据集切分，重新评估了多个事件提取方法，并探索了大型语言模型在事件提取中的能力和未来挑战。

事件提取的再评估：过去、现在和未来的挑战

A Reevaluation of Event Extraction: Past, Present, and Future Challenges

The task of long-form question answering (LFQA) involves retrieving documents
relevant to a given question and using them to generate a paragraph-length
answer. While many models have recently been proposed for LFQA, we show in this
paper that the task formulation raises fundamental challenges regarding
evaluation and dataset creation that currently preclude meaningful modeling
progress. To demonstrate these challenges, we first design a new system that
relies on sparse attention and contrastive retriever learning to achieve
state-of-the-art performance on the ELI5 LFQA dataset. While our system tops
the public leaderboard, a detailed analysis reveals several troubling trends:
(1) our system's generated answers are not actually grounded in the documents
that it retrieves; (2) ELI5 contains significant train / validation overlap, as
at least 81% of ELI5 validation questions occur in paraphrased form in the
training set; (3) ROUGE-L is not an informative metric of generated answer
quality and can be easily gamed; and (4) human evaluations used for other text
generation tasks are unreliable for LFQA. We offer suggestions to mitigate each
of these issues, which we hope will lead to more rigorous LFQA research and
meaningful progress in the future.

该论文探讨了长篇问答任务中关于评估和数据集构建所面临的挑战，在提出新模型的同时指出该任务中 ROUGE-L 评估不具信息性，且训练集和验证集存在显著重复。给出了缓解这些问题的建议。