Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.

本研究解决了大型语言模型在非目标使用中面临的挑战，现有防护措施存在误报率高和适应性不足的问题。通过定义问题空间并生成多样化的提示，构建了合成数据集，以提升防护措施的有效性，结果表明新方法优于传统启发式方法。此外，研究还开源了合成数据集和防护模型，为预生产环境中的防护开发和未来研究提供支持。

灵活的大型语言模型防护措施开发方法论：应用于主题偏离提示检测