Increasingly larger datasets have become a standard ingredient to advancing the state of the art in NLP. However, data quality might have already become the bottleneck to unlock further gains. Given the diversity and the sizes of modern datasets, standard data filtering is not straight-forward to apply, because of the multifacetedness of the harmful data and elusiveness of filtering rules that would generalize across multiple tasks. We study the fitness of task-agnostic self-influence scores of training examples for data cleaning, analyze their efficacy in capturing naturally occurring outliers, and investigate to what extent self-influence based data cleaning can improve downstream performance in machine translation, question answering and text classification, building up on recent approaches to self-influence calculation and automated curriculum learning.

本文研究使用任务不可知的自我影响分数对训练数据进行清洗的有效性，通过分析其在捕捉自然异常值方面的功效来调查自我影响数据清洗对机器翻译、问答和文本分类等任务的改进程度，利用自我影响计算的最新方法和自动课程学习作为基础。

让每个样本都有价值：自我影响在学习嘈杂自然语言处理数据中的稳定性与效用