Machine learning (ML) technologies have become substantial in practically all
aspects of our society, and data quality (DQ) is critical for the performance,
fairness, robustness, safety, and scalability of ML models. With the large and
complex data in data-centric AI, traditional methods like exploratory data
analysis (EDA) and cross-validation (CV) face challenges, highlighting the
importance of mastering DQ tools. In this survey, we review 17 DQ evaluation
and improvement tools in the last 5 years. By introducing the DQ dimensions,
metrics, and main functions embedded in these tools, we compare their strengths
and limitations and propose a roadmap for developing open-source DQ tools for
ML. Based on the discussions on the challenges and emerging trends, we further
highlight the potential applications of large language models (LLMs) and
generative AI in DQ evaluation and improvement for ML. We believe this
comprehensive survey can enhance understanding of DQ in ML and could drive
progress in data-centric AI. A complete list of the literature investigated in
this survey is available on GitHub at:
this https URL

机器学习中数据质量评估工具的回顾与比较，提出了开源数据质量工具发展的路线图，并探讨了大型语言模型和生成式人工智能在数据质量评估和改进中的潜在应用。

关于机器学习数据质量维度与工具的调研

A Survey on Data Quality Dimensions and Tools for Machine Learning

The instruction-following ability of Large Language Models (LLMs) has
cultivated a class of LLM-based systems capable of approaching complex tasks
such as making edits to large code repositories. Due to the high sensitivity
and unpredictability of LLM behavior in response to changes in prompting,
robust evaluation tools are needed to drive future iteration of these systems.
We propose RES-Q, a natural language instruction-based benchmark for evaluating
$\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of
100 repository editing tasks derived from real GitHub commits. Given an edit
instruction and a code repository, RES-Q evaluates an LLM system's ability to
gather information and construct an edit that satisfies the criteria set by the
instruction. We argue that evaluating LLMs in this way addresses issues with
traditional benchmarks and provides a more holistic assessment of a model's
abilities. We evaluate various state-of-the-art LLMs as language agents in a
repository-editing system built on Qurrent OS, our language agent development
software. Despite their 1% pass@1 performance difference on HumanEval, we find
Claude Sonnet 3.5 outperforms GPT-4o by 12% pass@1 on RES-Q, indicating RES-Q's
capacity to differentiate model capability as traditional benchmarks approach
saturation. We further investigate token efficiency, performance relationships
with existing benchmarks, and interesting disparities between closed and
open-source LLMs. Code and dataset are available at
this https URL

通过提出的基于自然语言指令的基准测试 RES-Q，对大型语言模型的指令遵循能力和代码仓库编辑系统进行了评估，发现模型能力存在差异，并提出了评估工具的需求。

RES-Q: 对代码编辑大规模语言模型系统的评估

RES-Q: Evaluating Code-Editing Large Language Model Systems at the  Repository Scale

The versatility of large language models (LLMs) led to the creation of
diverse benchmarks that thoroughly test a variety of language models'
abilities. These benchmarks consist of tens of thousands of examples making
evaluation of LLMs very expensive. In this paper, we investigate strategies to
reduce the number of evaluations needed to assess the performance of an LLM on
several key benchmarks. For example, we show that to accurately estimate the
performance of an LLM on MMLU, a popular multiple-choice QA benchmark
consisting of 14K examples, it is sufficient to evaluate this LLM on 100
curated examples. We release evaluation tools and tiny versions of popular
benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical
analysis demonstrates that these tools and tiny benchmarks are sufficient to
reliably and efficiently reproduce the original evaluation results.

通过研究 LLM 在各种关键基准测试中的表现，我们探索了减少 LLM 性能评估所需评估次数的策略，并发布了评估工具和微型基准测试，证明这些工具和测试足以可靠高效地复现原始评估结果。

tinyBenchmarks: 用较少的样例评估 LLM

tinyBenchmarks: evaluating LLMs with fewer examples

While LLMs can provide reasoned explanations along with their answers, the
nature and quality of those explanations are still poorly understood. In
response, our goal is to define a detailed way of characterizing the
explanation capabilities of modern models and to create a nuanced,
interpretable explanation evaluation tool that can generate such
characterizations automatically, without relying on expensive API calls or
human annotations. Our approach is to (a) define the new task of explanation
critiquing - identifying and categorizing any main flaw in an explanation and
providing suggestions to address the flaw, (b) create a sizeable,
human-verified dataset for this task, and (c) train an open-source, automatic
critiquing model (called Digital Socrates) using this data. Through
quantitative and qualitative analysis, we demonstrate how Digital Socrates is
useful for revealing insights about student models by examining their reasoning
chains, and how it can provide high-quality, nuanced, automatic evaluation of
those model explanations for the first time. Digital Socrates thus fills an
important gap in evaluation tools for understanding and improving the
explanation behavior of models.

通过定义解释评议任务、建立数据集并使用数学分析，我们提出了 Digital Socrates 模型，它可以量化和质化地自动评估 LLM 模型的解释能力，填补了模型解释行为评估工具的重要空白。