Reinforcement Learning (RL) has emerged as an efficient method of choice for
solving complex sequential decision making problems in automatic control,
computer science, economics, and biology. In this paper we present a model-free
RL algorithm to synthesize control policies that maximize the probability of
satisfying high-level control objectives given as Linear Temporal Logic (LTL)
formulas. Uncertainty is considered in the workspace properties, the structure
of the workspace, and the agent actions, giving rise to a
Probabilistically-Labeled Markov Decision Process (PL-MDP) with unknown graph
structure and stochastic behaviour, which is even more general case than a
fully unknown MDP. We first translate the LTL specification into a Limit
Deterministic Buchi Automaton (LDBA), which is then used in an on-the-fly
product with the PL-MDP. Thereafter, we define a synchronous reward function
based on the acceptance condition of the LDBA. Finally, we show that the RL
algorithm delivers a policy that maximizes the satisfaction probability
asymptotically. We provide experimental results that showcase the efficiency of
the proposed method.

本研究提出一种基于强化学习的控制策略综合算法，用于最大化满足作为线性时序逻辑公式给出的高级控制目标的概率。该算法将 LTL 规范转换为限制性确定布琦自动机，再与具有不确定工作空间特性、结构和智能体行为的 PL-MDP 合并进行训练，从而生成满足概率的最大值。

强化学习用于具有概率满足保证的时间逻辑控制合成

Reinforcement Learning for Temporal Logic Control Synthesis with  Probabilistic Satisfaction Guarantees

This work explores the trade-off between the number of samples required to
accurately build models of dynamical systems and the degradation of performance
in various control objectives due to a coarse approximation. In particular, we
show that simple models can be easily fit from input/output data and are
sufficient for achieving various control objectives. We derive bounds on the
number of noisy input/output samples from a stable linear time-invariant system
that are sufficient to guarantee that the corresponding finite impulse response
approximation is close to the true system in the $\mathcal{H}_\infty$-norm. We
demonstrate that these demands are lower than those derived in prior art which
aimed to accurately identify dynamical models. We also explore how different
physical input constraints, such as power constraints, affect the sample
complexity. Finally, we show how our analysis fits within the established
framework of robust control, by demonstrating how a controller designed for an
approximate system provably meets performance objectives on the true system.

探讨了在粗略的近似下能够准确构建动态系统模型所需的样本数量与各种控制目标因性能降低而产生的权衡，给出了稳定线性时不变系统的噪声输入 / 输出样本数的上限，证明了这些需求低于先前旨在准确识别动态模型的需求，并阐述了不同物理输入约束如何影响样本复杂性，最后展示了分析如何适用于强健控制的已建立框架，证明了设计用于近似系统的控制器能够满足真实系统的性能目标。