Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.

本文提出了一种基于Influence Subset Selection（ISS）的方法，利用端到端任务知识选择一个较小的语言模型预训练语料库子集，并以较低的计算成本获得与RoBERTa等大型预训练模型相媲美的性能。

有影响力的子集选择用于语言模型的告别漫无目的的大规模预训练