With the proliferation of red-teaming strategies for Large Language Models
(LLMs), the deficiency in the literature about improving the safety and
robustness of LLM defense strategies is becoming increasingly pronounced. This
paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play
prefix module designed to reconstruct the input prompt with just a few ($<30$)
additional tokens, effectively reducing toxicity in responses from target LLMs.
The sentinel model naturally overcomes the \textit{parameter inefficiency} and
\textit{limited model accessibility} for fine-tuning large target models. We
employ an interleaved training regimen using Proximal Policy Optimization (PPO)
to optimize both red team and sentinel models dynamically, incorporating a
value head-sharing mechanism inspired by the multi-agent centralized critic to
manage the complex interplay between agents. Our extensive experiments across
text-to-text and text-to-image demonstrate the effectiveness of our approach in
mitigating toxic outputs, even when dealing with larger models like
\texttt{Llama-2}, \texttt{GPT-3.5} and \texttt{Stable-Diffusion}, highlighting
the potential of our framework in enhancing safety and robustness in various
applications.

通过介绍基于 LLM 的哨兵模型，该论文提出一种即插即用的前缀模块，通过添加少量的 (<30) 令牌有效地减少目标 LLM 输出中的有毒内容，克服参数效率和模型可访问性的限制。我们采用交错训练方案，使用近端策略优化 (PPO) 来动态优化红队和哨兵模型，并结合通过多智能体集中批判者启发的值头共享机制来管理代理之间的复杂相互作用。我们在文本到文本和文本到图像的广泛实验中证明了我们方法的有效性，即使处理像 Llama-2、GPT-3.5 和 Stable-Diffusion 这样的大型模型也能降低有毒输出，突显了我们框架在提升各种应用程序的安全性和稳健性方面的潜力。

微小的改进引发韧性：朝着高效的前缀模型抵御 LLM 红队行动

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model  Against LLM Red-Teaming

Model-based reinforcement learning seeks to simultaneously learn the dynamics
of an unknown stochastic environment and synthesise an optimal policy for
acting in it. Ensuring the safety and robustness of sequential decisions made
through a policy in such an environment is a key challenge for policies
intended for safety-critical scenarios. In this work, we investigate two
complementary problems: first, computing reach-avoid probabilities for
iterative predictions made with dynamical models, with dynamics described by
Bayesian neural network (BNN); second, synthesising control policies that are
optimal with respect to a given reach-avoid specification (reaching a "target"
state, while avoiding a set of "unsafe" states) and a learned BNN model. Our
solution leverages interval propagation and backward recursion techniques to
compute lower bounds for the probability that a policy's sequence of actions
leads to satisfying the reach-avoid specification. Such computed lower bounds
provide safety certification for the given policy and BNN model. We then
introduce control synthesis algorithms to derive policies maximizing said lower
bounds on the safety probability. We demonstrate the effectiveness of our
method on a series of control benchmarks characterized by learned BNN dynamics
models. On our most challenging benchmark, compared to purely data-driven
policies the optimal synthesis algorithm is able to provide more than a
four-fold increase in the number of certifiable states and more than a
three-fold increase in the average guaranteed reach-avoid probability.

本研究主要探讨基于模型的强化学习中的安全性和鲁棒性问题，包括使用贝叶斯神经网络描述动态模型来计算迭代预测的到达 - 避免概率，以及使用控制综合算法综合出最佳控制策略以满足安全性约束和学习到的动态模型。