BriefGPT.xyz
Apr, 2024
使用日本报纸和付费墙对领域特定预训练语言模型的记忆量进行量化
Quantifying Memorization of Domain-Specific Pre-trained Language Models using Japanese Newspaper and Paywalls
HTML
PDF
Shotaro Ishihara
TL;DR
使用限定语料库的日本报纸文章预训练领域特定的GPT-2模型,研究发现领域特定的预训练语言模型在生成过程中有大规模的复制粘贴行为,而且记忆与重复、模型大小和提示长度等因素相关。
Abstract
Dominant
pre-trained language models
(PLMs) have been successful in high-quality natural language generation. However, the analysis of their generation is not mature: do they acquire generalizable
linguistic abstraction
→