Large Language Models (LLMs) are now widely used in various applications,
making it crucial to align their ethical standards with human values. However,
recent jail-breaking methods demonstrate that this alignment can be undermined
using carefully constructed prompts. In our study, we reveal a new threat to
LLM alignment when a bad actor has access to the model's output logits, a
common feature in both open-source LLMs and many commercial LLM APIs (e.g.,
certain GPT models). It does not rely on crafting specific prompts. Instead, it
exploits the fact that even when an LLM rejects a toxic request, a harmful
response often hides deep in the output logits. By forcefully selecting
lower-ranked output tokens during the auto-regressive generation process at a
few critical output positions, we can compel the model to reveal these hidden
responses. We term this process model interrogation. This approach differs from
and outperforms jail-breaking methods, achieving 92% effectiveness compared to
62%, and is 10 to 20 times faster. The harmful content uncovered through our
method is more relevant, complete, and clear. Additionally, it can complement
jail-breaking strategies, with which results in further boosting attack
performance. Our findings indicate that interrogation can extract toxic
knowledge even from models specifically designed for coding tasks.

大型语言模型的伦理标准与人类价值的对齐可以通过模型输出日志的滥用来被破坏，我们提出的模型审问方法能够揭示隐藏在输出日志中的有害回复，有效性达到 92％，速度快 10 到 20 倍，对编码任务也适用。