Surrogate rewards for linear temporal logic (LTL) objectives are commonly
utilized in planning problems for LTL objectives. In a widely-adopted surrogate
reward approach, two discount factors are used to ensure that the expected
return approximates the satisfaction probability of the LTL objective. The
expected return then can be estimated by methods using the Bellman updates such
as reinforcement learning. However, the uniqueness of the solution to the
Bellman equation with two discount factors has not been explicitly discussed.
We demonstrate with an example that when one of the discount factors is set to
one, as allowed in many previous works, the Bellman equation may have multiple
solutions, leading to inaccurate evaluation of the expected return. We then
propose a condition for the Bellman equation to have the expected return as the
unique solution, requiring the solutions for states inside a rejecting bottom
strongly connected component (BSCC) to be 0. We prove this condition is
sufficient by showing that the solutions for the states with discounting can be
separated from those for the states without discounting under this condition

在利用贝尔曼方程求解线性时态逻辑目标的规划问题中，我们发现采用两个折扣因子的替代奖励方法能够逼近时态逻辑目标的满足概率，但当一个折扣因子设为 1 时，贝尔曼方程可能存在多解从而导致期望回报评估不准确。我们提出了一个条件，使得贝尔曼方程等式有期望回报的唯一解，要求拒绝底部强连通分量内的状态的解为 0，并通过证明说明该条件足以将有折扣状态的解与无折扣状态的解分离。

关于 LTL 目标的 Bellman 方程解的唯一性

On the Uniqueness of Solution for the Bellman Equation of LTL Objectives

Deep Reinforcement Learning has been shown to be very successful in complex
games, e.g. Atari or Go. These games have clearly defined rules, and hence
allow simulation. In many practical applications, however, interactions with
the environment are costly and a good simulator of the environment is not
available. Further, as environments differ by application, the optimal
inductive bias (architecture, hyperparameters, etc.) of a reinforcement agent
depends on the application. In this work, we propose a multi-arm bandit
framework that selects from a set of different reinforcement learning agents to
choose the one with the best inductive bias. To alleviate the problem of sparse
rewards, the reinforcement learning agents are augmented with surrogate
rewards. This helps the bandit framework to select the best agents early, since
these rewards are smoother and less sparse than the environment reward. The
bandit has the double objective of maximizing the reward while the agents are
learning and selecting the best agent after a finite number of learning steps.
Our experimental results on standard environments show that the proposed
framework is able to consistently select the optimal agent after a finite
number of steps, while collecting more cumulative reward compared to selecting
a sub-optimal architecture or uniformly alternating between different agents.

本文提出一种基于多臂赌博机框架的深度强化学习方法，通过选择最适合特定应用的学习模型和增强学习代理，解决了实际应用中环境不明确和奖励不稳定等问题。实验结果表明该方法在标准环境下能够选出最优代理，并且相较于其他策略在同样步数内获得更高的累计奖励值。

一种用于选择强化学习智能体的赌博机框架

A Bandit Framework for Optimal Selection of Reinforcement Learning  Agents

Recent studies have shown that reinforcement learning (RL) models are
vulnerable in various noisy scenarios. For instance, the observed reward
channel is often subject to noise in practice (e.g., when rewards are collected
through sensors), and is therefore not credible. In addition, for applications
such as robotics, a deep reinforcement learning (DRL) algorithm can be
manipulated to produce arbitrary errors by receiving corrupted rewards. In this
paper, we consider noisy RL problems with perturbed rewards, which can be
approximated with a confusion matrix. We develop a robust RL framework that
enables agents to learn in noisy environments where only perturbed rewards are
observed. Our solution framework builds on existing RL/DRL algorithms and
firstly addresses the biased noisy reward setting without any assumptions on
the true distribution (e.g., zero-mean Gaussian noise as made in previous
works). The core ideas of our solution include estimating a reward confusion
matrix and defining a set of unbiased surrogate rewards. We prove the
convergence and sample complexity of our approach. Extensive experiments on
different DRL platforms show that trained policies based on our estimated
surrogate reward can achieve higher expected rewards, and converge faster than
existing baselines. For instance, the state-of-the-art PPO algorithm is able to
obtain 84.6% and 80.8% improvements on average score for five Atari games, with
error rates as 10% and 30% respectively.

本研究旨在提出一种在充满噪音的环境中学习的鲁棒性强的强化学习框架以及利用替代奖励来训练优化策略，实验表明我们的方法在提高期望奖励、加速收敛等方面的效果优于现有基线算法。