Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of $\textit{word-level}$ or $\textit{document-level}$ privatization. Recently, several word-level $\textit{Metric}$ Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating $\textit{between}$ the word and sentence levels, namely with $\textit{collocations}$. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.

应用差分隐私（DP）在自然语言处理中的研究必须区分其操作的句法级别，通常采用单词级或文档级的隐私化形式。最近，已经提出了几种基于Word Embedding空间的通用DP概念的单词级Metric Differential Privacy方法。然而，这些方法往往无法产生语义连贯的文本输出，只能通过基本的单词扰动组合实现在句子或文档级别的应用。本研究通过在单词和句子级别之间操作，即使用Collocations，来解决这些挑战。通过扰动n-grams而不是单个单词，我们设计了一种方法，其中组合的隐私化输出具有更高的语义连贯性和可变长度。我们通过构建一个基于频繁出现的单词组的嵌入模型来实现这一目标，在该模型中，unigram词与bi-和trigram collocations共存。我们在效用和隐私测试中评估了我们的方法，明确提出了超越单词级的标记化策略。

基于搭配的方法应对词级度量差分隐私挑战