Social chatbots, also known as chit-chat chatbots, evolve rapidly with large
pretrained language models. Despite the huge progress, privacy concerns have
arisen recently: training data of large language models can be extracted via
model inversion attacks. On the other hand, the datasets used for training
chatbots contain many private conversations between two individuals. In this
work, we further investigate the privacy leakage of the hidden states of
chatbots trained by language modeling which has not been well studied yet. We
show that speakers' personas can be inferred through a simple neural network
with high accuracy. To this end, we propose effective defense objectives to
protect persona leakage from hidden states. We conduct extensive experiments to
demonstrate that our proposed defense objectives can greatly reduce the attack
accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve
language models' powerful generation ability.

本研究旨在探究基于语言模型训练的社交聊天机器人中隐藏状态的隐私泄露问题，并提出了有效的防御目标以保护用户隐私。通过大量实验验证，我们的防御目标可以将攻击准确率从 37.6% 降低到 0.5%。