This paper delves into the problem of safe reinforcement learning (RL) in a
partially observable environment with the aim of achieving safe-reachability
objectives. In traditional partially observable Markov decision processes
(POMDP), ensuring safety typically involves estimating the belief in latent
states. However, accurately estimating an optimal Bayesian filter in POMDP to
infer latent states from observations in a continuous state space poses a
significant challenge, largely due to the intractable likelihood. To tackle
this issue, we propose a stochastic model-based approach that guarantees RL
safety almost surely in the face of unknown system dynamics and partial
observation environments. We leveraged the Predictive State Representation
(PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future
multi-step observations analytically, and the results in this context are
provable. Furthermore, we derived essential operators from the kernel Bayes'
rule, enabling the recursive estimation of future observations using various
operators. Under the assumption of \textit{undercompleness}, a polynomial
sample complexity is established for the RL algorithm for the infinite size of
observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy
guarantee.

本文探讨了在部分可观察环境下的安全强化学习问题，旨在实现安全可达性目标。通过提出一种基于随机模型的方法，在面对未知系统动态和部分观测环境时，几乎确定地保证了强化学习的安全性。利用预测状态表示和再生核希尔伯特空间，对未来的多步观测进行了解析表示，并通过核贝叶斯规则导出了关键操作，可以使用不同的操作递归估计未来的观测。在假设观测和动作空间无限大的情况下，为强化学习算法建立了多项式样本复杂度，确保了 ε- 次优安全策略保证。

张量再生核希尔伯特空间中的安全强化学习

Safe Reinforcement Learning in Tensor Reproducing Kernel Hilbert Space

We present a computational framework for synthesis of distributed control
strategies for a heterogeneous team of robots in a partially observable
environment. The goal is to cooperatively satisfy specifications given as
Truncated Linear Temporal Logic (TLTL) formulas. Our approach formulates the
synthesis problem as a stochastic game and employs a policy graph method to
find a control strategy with memory for each agent. We construct the stochastic
game on the product between the team transition system and a finite state
automaton (FSA) that tracks the satisfaction of the TLTL formula. We use the
quantitative semantics of TLTL as the reward of the game, and further reshape
it using the FSA to guide and accelerate the learning process. Simulation
results demonstrate the efficacy of the proposed solution under demanding task
specifications and the effectiveness of reward shaping in significantly
accelerating the speed of learning.

本文提出了一个基于计算框架的分布式控制策略合成方法，用于处理存在部分观测的异质机器人团队，旨在满足 Truncated Linear Temporal Logic（TLTL）规范，其方法将综合问题表述为一个随机博弈，并采用策略图方法为每个机器人寻找具有内存的控制策略，模拟结果表明其解决方案的有效性和奖励塑形的有效性。

基于时序逻辑奖励塑形的强化学习分布式控制

Distributed Control using Reinforcement Learning with Temporal-Logic-Based Reward Shaping

We consider apprenticeship learning, i.e., having an agent learn a task by
observing an expert demonstrating the task in a partially observable
environment when the model of the environment is uncertain. This setting is
useful in applications where the explicit modeling of the environment is
difficult, such as a dialogue system. We show that we can extract information
about the environment model by inferring action selection process behind the
demonstration, under the assumption that the expert is choosing optimal actions
based on knowledge of the true model of the target environment. Proposed
algorithms can achieve more accurate estimates of POMDP parameters and better
policies from a short demonstration, compared to methods that learns only from
the reaction from the environment.

通过推断专家演示背后的行动选择过程，学习具有一定不确定性的部分可观测环境中的任务，可以更准确地估计 POMDP 参数并从短暂演示中获得更好的策略，与仅从环境反应学习的方法相比更为有效。