Shielding is a popular technique for achieving safe reinforcement learning
(RL). However, classical shielding approaches come with quite restrictive
assumptions making them difficult to deploy in complex environments,
particularly those with continuous state or action spaces. In this paper we
extend the more versatile approximate model-based shielding (AMBS) framework to
the continuous setting. In particular we use Safety Gym as our test-bed,
allowing for a more direct comparison of AMBS with popular constrained RL
algorithms. We also provide strong probabilistic safety guarantees for the
continuous setting. In addition, we propose two novel penalty techniques that
directly modify the policy gradient, which empirically provide more stable
convergence in our experiments.

本文介绍了在连续环境中实现安全强化学习的方法，使用了适用于连续环境的近似基于模型的屏蔽 (AMBS) 框架，并提出了两种新的惩罚技术来改进策略梯度的稳定收敛性。

利用近似模型防护在连续环境中实现概率安全保证

Leveraging Approximate Model-based Shielding for Probabilistic Safety  Guarantees in Continuous Environments

The problem of learning logical rules from examples arises in diverse fields,
including program synthesis, logic programming, and machine learning. Existing
approaches either involve solving computationally difficult combinatorial
problems, or performing parameter estimation in complex statistical models.
In this paper, we present Difflog, a technique to extend the logic
programming language Datalog to the continuous setting. By attaching
real-valued weights to individual rules of a Datalog program, we naturally
associate numerical values with individual conclusions of the program.
Analogous to the strategy of numerical relaxation in optimization problems, we
can now first determine the rule weights which cause the best agreement between
the training labels and the induced values of output tuples, and subsequently
recover the classical discrete-valued target program from the continuous
optimum.
We evaluate Difflog on a suite of 34 benchmark problems from recent
literature in knowledge discovery, formal verification, and database
query-by-example, and demonstrate significant improvements in learning complex
programs with recursive rules, invented predicates, and relations of arbitrary
arity.

本文提出了一种称为 Difflog 的技术，可以将逻辑规则从离散变量扩展到连续变量，该技术通过为 Datalog 程序的各个规则附加实值权重，自然地将数值与程序的各个结论相关联，在知识发现、形式验证和数据库查询等问题上实现学习复杂程序的显着提高。