Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks. We release our models, data, and code for future research.

通过利用维基百科分类法，建立一个拥有79K个软件实体和12种详细类型的全面实体词典以及一个包含1.7M个句子的大型标注数据集，我们提出了自我正则化的软件实体识别（SER）模型训练方法，能够克服语料库中噪音和训练数据不足的问题，并在维基百科和两个Stack Overflow基准测试中展现了优于基准模型和现有方法的性能。我们公开了我们的模型、数据和代码，以供未来研究使用。

噪音鲁棒学习的软件实体识别