With the help of simple fine-tuning, one can artificially embed hidden text
into large language models (LLMs). This text is revealed only when triggered by
a specific query to the LLM. Two primary applications are LLM fingerprinting
and steganography. In the context of LLM fingerprinting, a unique text
identifier (fingerprint) is embedded within the model to verify licensing
compliance. In the context of steganography, the LLM serves as a carrier for
hidden messages that can be disclosed through a designated trigger.
Our work demonstrates that embedding hidden text in the LLM via fine-tuning,
though seemingly secure due to the vast number of potential triggers (any
sequence of characters or tokens could serve as a trigger), is susceptible to
extraction through analysis of the LLM's output decoding process. We propose a
novel approach to extraction called Unconditional Token Forcing. It is premised
on the hypothesis that iteratively feeding each token from the LLM's vocabulary
into the model should reveal sequences with abnormally high token
probabilities, indicating potential embedded text candidates. Additionally, our
experiments show that when the first token of a hidden fingerprint is used as
an input, the LLM not only produces an output sequence with high token
probabilities, but also repetitively generates the fingerprint itself. We also
present a method to hide text in such a way that it is resistant to
Unconditional Token Forcing, which we named Unconditional Token Forcing
Confusion.

使用简单的微调技术，可以将隐藏的文本嵌入到大型语言模型中，而只有在触发特定查询时才会显现。这项工作表明通过微调将隐藏文本嵌入到语言模型中，虽然由于潜在触发器的巨大数量（任何字符或标记的序列都可以作为触发器）而看似安全，但仍然容易通过对语言模型输出解码过程的分析来提取其中的隐藏文本。