Ensuring the safety of AI systems has recently emerged as a critical priority for real-world deployment, particularly in physical AI applications. Current approaches to AI safety typically address predefined domain-specific safety conditions, limiting their ability to generalize across contexts. We propose a novel AI safety framework that ensures AI systems comply with any user-defined constraint, with any desired probability, and across various domains. In this framework, we combine an AI component (e.g., neural network) with an optimization problem to produce responses that minimize objectives while satisfying user-defined constraints with probabilities exceeding user-defined thresholds. For credibility assessment of the AI component, we propose internal test data, a supplementary set of safety-labeled data, and a conservative testing methodology that provides statistical validity of using internal test data. We also present an approximation method of a loss function and how to compute its gradient for training. We mathematically prove that probabilistic constraint satisfaction is guaranteed under specific, mild conditions and prove a scaling law between safety and the number of internal test data. We demonstrate our framework's effectiveness through experiments in diverse domains: demand prediction for production decision, safe reinforcement learning within the SafetyGym simulator, and guarding AI chatbot outputs. Through these experiments, we demonstrate that our method guarantees safety for user-specified constraints, outperforms for up to several order of magnitudes existing methods in low safety threshold regions, and scales effectively with respect to the size of internal test data.

本研究针对现有AI安全方法通常局限于特定领域的安全条件这一问题，提出了一种新的AI安全框架，该框架能确保AI系统满足用户定义的约束条件，并具有任何所需的概率。通过实验证明，该框架在多个领域中有效，能够在低安全阈值区域内显著优于现有方法，并有效与内部测试数据的规模进行扩展。

一种领域无关的可扩展AI安全保障框架