In the rapidly evolving domain of artificial intelligence, safeguarding the
intellectual property of Large Language Models (LLMs) is increasingly crucial.
Current watermarking techniques against model extraction attacks, which rely on
signal insertion in model logits or post-processing of generated text, remain
largely heuristic. We propose a novel method for embedding learnable linguistic
watermarks in LLMs, aimed at tracing and preventing model extraction attacks.
Our approach subtly modifies the LLM's output distribution by introducing
controlled noise into token frequency distributions, embedding an statistically
identifiable controllable watermark.We leverage statistical hypothesis testing
and information theory, particularly focusing on Kullback-Leibler Divergence,
to differentiate between original and modified distributions effectively. Our
watermarking method strikes a delicate well balance between robustness and
output quality, maintaining low false positive/negative rates and preserving
the LLM's original performance.

在快速发展的人工智能领域中，保护大型语言模型（LLMs）的知识产权变得越来越关键。我们提出了一种新颖的方法，在 LLMs 中嵌入可学习的语言水印，以追踪和防止模型提取攻击。我们的方法通过向令牌频率分布中引入可控噪声来微妙地修改 LLM 的输出分布，嵌入可统计辨识的可控水印。我们利用统计假设检验和信息理论，特别关注库尔巴克 - 莱布勒散度，有效区分原始分布和修改分布。我们的水印方法在鲁棒性和输出质量之间达到了微妙的平衡，保持了较低的误报率和漏报率，并且保留了 LLM 的原始性能。