Self-training provides an effective means of using an extremely small amount of labeled data to create pseudo-labels for unlabeled data. Many state-of-the-art self-training approaches hinge on different regularization methods to prevent overfitting and improve generalization. Yet they still rely heavily on predictions initially trained with the limited labeled data as pseudo-labels and are likely to put overconfident label belief on erroneous classes depending on the first prediction. To tackle this issue in text classification, we introduce LST, a simple self-training method that uses a lexicon to guide the pseudo-labeling mechanism in a linguistically-enriched manner. We consistently refine the lexicon by predicting confidence of the unseen data to teach pseudo-labels better in the training iterations. We demonstrate that this simple yet well-crafted lexical knowledge achieves 1.0-2.0% better performance on 30 labeled samples per class for five benchmark datasets than the current state-of-the-art approaches.

本文介绍了一种使用词汇表来指导伪标记机制的简单的自训练方法，即LST。通过使用语言丰富的方式，我们不断优化词汇表来预测未见数据的置信度，从而更好地教授伪标签，实现了5个基准数据集每个类别30个标注样本的1.0-2.0％的性能提高。

LST: 基于词典引导的自训练在小样本文本分类中的应用