This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how do the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-Label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.

研究介绍了一个基于学生-教师框架的半监督学习(SSl)生产管道，利用数百万未标记的示例来改善自然语言理解(NLU)任务，并调查了两个与未标记数据在生产SSL环境中的使用相关的问题：1）如何选择从大量未标记数据池中受益于SSL培训的样本，2）选定数据如何影响不同的最先进的SSL技术的性能。结合委员会选择和子模块优化选择两种数据选择方法，比较了四种广泛使用的SSL技术，包括伪标签(PL)、知识蒸馏(KD)、虚拟对抗训练(VAT)和交叉视图训练(CVT)。我们进一步探讨了这些技术在意向分类(IC)和命名实体识别(NER)任务中的优缺点，并提供了指导方针，指定每种方法何时可能有益于改善大规模NLU系统。

行业级别自然语言理解半监督学习