Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated "attacks" may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release

本文描述了一系列旨在从已训练的BERT模型中恢复个人健康信息(PHI)的方法，同时提供了实验设置和基准探测模型，以促进类似研究。结果显示简单的探测方法无法有效地从MIMIC-III EHR训练的BERT中提取敏感信息，但更复杂的“攻击”可能会成功。因此，释放训练过的类似EHR的BERT模型是否存在数据隐私问题，需要更加深入的研究。

BERT在临床笔记上的预训练是否会透露敏感数据？