通过反事实大型语言模型推理增强强化学习安全性

Sep, 2024

通过反事实大型语言模型推理增强强化学习安全性

Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross, Helge Spieker

TL;DR本研究解决了强化学习（RL）政策存在的安全性不足和难以解释的问题。通过引入反事实大型语言模型推理的方法，研究表明该方法在训练后显著提升了RL政策的安全性，并有助于提供更好的解释。此工作为强化学习的安全性保障提供了新的思路和方法。

Abstract

Reinforcement Learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy Safety post-training. We show that our approach