The conventional success of textual classification relies on annotated data,
and the new paradigm of pre-trained language models (PLMs) still requires a few
labeled data for downstream tasks. However, in real-world applications, label
noise inevitably exists in training data, damaging the effectiveness,
robustness, and generalization of the models constructed on such data.
Recently, remarkable achievements have been made to mitigate this dilemma in
visual data, while only a few explore textual data. To fill this gap, we
present SelfMix, a simple yet effective method, to handle label noise in text
classification tasks. SelfMix uses the Gaussian Mixture Model to separate
samples and leverages semi-supervised learning. Unlike previous works requiring
multiple models, our method utilizes the dropout mechanism on a single model to
reduce the confirmation bias in self-training and introduces a textual-level
mixup training strategy. Experimental results on three text classification
benchmarks with different types of text show that the performance of our
proposed method outperforms these strong baselines designed for both textual
and visual data under different noise ratios and noise types. Our code is
available at this https URL.

本文提出一种用于处理文本分类任务中标签噪声的简单而有效的方法 SelfMix，该方法使用高斯混合模型来分离样本，并利用半监督学习。实验结果表明，我们的方法在不同类型的文本噪声下，比设计用于文本和视觉数据的强基线表现更优秀。

SelfMix: 自我混合训练抵御文本标签噪声的稳健学习

SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

Data augmentation, the artificial creation of training data for machine
learning by transformations, is a widely studied research field across machine
learning disciplines. While it is useful for increasing a model's
generalization capabilities, it can also address many other challenges and
problems, from overcoming a limited amount of training data, to regularizing
the objective, to limiting the amount data used to protect privacy. Based on a
precise description of the goals and applications of data augmentation and a
taxonomy for existing works, this survey is concerned with data augmentation
methods for textual classification and aims to provide a concise and
comprehensive overview for researchers and practitioners. Derived from the
taxonomy, we divide more than 100 methods into 12 different groupings and give
state-of-the-art references expounding which methods are highly promising by
relating them to each other. Finally, research perspectives that may constitute
a building block for future work are provided.

本文旨在提高机器学习分类系统的泛化能力，通过对数据进行转换的方式来人工创建训练数据，从而增强数据的多样性，该文章就在「文本分类」中，对数据增强方法及其应用目标做了详细概述与分类，最终针对相关领域，提出建设性思路方向。