Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.

本研究针对对大型语言模型（LLMs）的审计问题进行探讨，特别是在无法访问其参数的情况下。提出了基于好奇心驱动的审计方法CALM，通过强化学习来优化审计过程，以识别有害和偏见的输入输出对，显著提高了识别不当行为的能力，为黑盒LLMs的审计提供了新思路。

CALM: 基于好奇心驱动的大型语言模型审计