Instruction-tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality issues of gold-standard labels. But so far, the application of AED methods is limited to discriminative settings. It is an open question how well AED methods generalize to generative settings which are becoming widespread via generative LLMs. In this work, we present a first and new benchmark for AED on instruction-tuning data: Donkii. It encompasses three instruction-tuning datasets enriched with annotations by experts and semi-automatic methods. We find that all three datasets contain clear-cut errors that sometimes directly propagate into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them comprehensively on the newly introduced dataset. Our results demonstrate that choosing the right AED method and model size is indeed crucial, thereby deriving practical recommendations. To gain insights, we provide a first case-study to examine how the quality of the instruction-tuning datasets influences downstream performance.

在这项研究中，我们提出了一个新的AED基准测试：Donkii，它包含了三个经过专家和半自动方法注释的指导调整数据集。我们发现这三个数据集中包含明显的错误，有时直接传播到指导调整的LLMs中。我们提出了四个适用于生成设置的AED基准，并在新引入的数据集上进行了全面评估。我们的结果表明选择正确的AED方法和模型大小确实至关重要，从而得出了实际建议。为了获得更多见解，我们提供了第一个案例研究，以检查指导调整数据集的质量对下游性能的影响。

Donkii：指导调校数据集中的注释错误检测方法能否发现错误？