Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.

提出了两种方法来保护文本数据中的个人可识别信息（PII）的隐私性，一种是使用机器学习改进结构化输入性能的基于特征的方法，另一种是考虑原始文本和泛化候选项之间的上下文和语义关系的新颖上下文感知框架。实验证明，上下文感知方法在不同尺度上优于基于特征的方法，通过突出特征选择、集成学习和融入上下文信息等方面推进了PII泛化技术的发展，从而更好地保护文本匿名化中的隐私保护。

比较基于特征和上下文感知的方法在个人身份信息概化级别预测中的应用