With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.

本研究解决了现有大型语言模型（LLMs）在多语言环境中处理毒性内容的有效性问题。通过引入一个涵盖七个数据集和十多种语言的综合多语言测试套件，研究评估了先进保护措施的性能及其针对新型越狱技术的韧性。研究发现现有保护措施在处理多语言毒性方面仍然无效，并缺乏对越狱提示的鲁棒性，旨在识别其局限性，以构建更可靠的多语言LLMs。

多语言毒性处理中的大型语言模型保护措施基准测试