自然语言生成数据集中数据错误的追踪和清除

Dec, 2022

自然语言生成数据集中数据错误的追踪和清除

Tracing and Removing Data Errors in Natural Language Generation Datasets

Faisal Ladhak, Esin Durmus, Tatsunori Hashimoto

TL;DR该研究提出了一种框架，利用基于对比度的算法识别和清除训练数据中的一些低质量样本，从而实现减少自然语言生成任务中的幻觉和不忠实输出的目的。

Abstract

Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in natural language generation (NLG) tasks. Consequently, identifying and removing these examples is