Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

本研究解决了临床领域因隐私风险导致缺乏广泛可用数据集的问题。通过将大型语言模型（LLMs）适应于临床领域，我们生成了带有个人可识别信息标签的合成临床文本，并用于训练合成命名实体识别（NER）模型。研究结果表明，使用合成语料库训练的NER模型在预测性能上仅有小幅下降，而这一过程的有效性几乎完全取决于使用原始数据训练的机器标注NER模型的表现。

数据约束下的去识别化训练数据合成