Reinforcement learning for control over continuous spaces typically uses
high-entropy stochastic policies, such as Gaussian distributions, for local
exploration and estimating policy gradient to optimize performance. Many
robotic control problems deal with complex unstable dynamics, where applying
actions that are off the feasible control manifolds can quickly lead to
undesirable divergence. In such cases, most samples taken from the ambient
action space generate low-value trajectories that hardly contribute to policy
improvement, resulting in slow or failed learning. We propose to improve action
selection in this model-free RL setting by introducing additional adaptive
control steps based on Extremum-Seeking Control (ESC). On each action sampled
from stochastic policies, we apply sinusoidal perturbations and query for
estimated Q-values as the response signal. Based on ESC, we then dynamically
improve the sampled actions to be closer to nearby optima before applying them
to the environment. Our methods can be easily added in standard policy
optimization to improve learning efficiency, which we demonstrate in various
control learning environments.

通过引入基于极值搜索控制的自适应控制步骤，改进了模型自由强化学习中的动作选择，在标准策略优化中提高了学习效率。

极值寻求动作选择以加速策略优化

Extremum-Seeking Action Selection for Accelerating Policy Optimization

Most entropy measures depend on the spread of the probability distribution
over the sample space X, and the maximum entropy achievable scales
proportionately with the sample space cardinality |X|. For a finite |X|, this
yields robust entropy measures which satisfy many important properties, such as
invariance to bijections, while the same is not true for continuous spaces
(where |X|=infinity). Furthermore, since R and R^d (d in Z+) have the same
cardinality (from Cantor's correspondence argument), cardinality-dependent
entropy measures cannot encode the data dimensionality. In this work, we
question the role of cardinality and distribution spread in defining entropy
measures for continuous spaces, which can undergo multiple rounds of
transformations and distortions, e.g., in neural networks. We find that the
average value of the local intrinsic dimension of a distribution, denoted as
ID-Entropy, can serve as a robust entropy measure for continuous spaces, while
capturing the data dimensionality. We find that ID-Entropy satisfies many
desirable properties and can be extended to conditional entropy, joint entropy
and mutual-information variants. ID-Entropy also yields new information
bottleneck principles and also links to causality. In the context of deep
learning, for feedforward architectures, we show, theoretically and
empirically, that the ID-Entropy of a hidden layer directly controls the
generalization gap for both classifiers and auto-encoders, when the target
function is Lipschitz continuous. Our work primarily shows that, for continuous
spaces, taking a structural rather than a statistical approach yields entropy
measures which preserve intrinsic data dimensionality, while being relevant for
studying various architectures.

本文探讨基于数据维度和结构本身而非基于统计的方法，提出一种计算连续空间熵的测度，称作 ID-Entropy，该熵测度适合在神经网络中广泛使用，可以保留数据固有的维度信息，并在分类器和自动编码器中直接控制泛化差距的大小。

本地内在维度熵

Local Intrinsic Dimensional Entropy

We consider differentially private algorithms for reinforcement learning in
continuous spaces, such that neighboring reward functions are
indistinguishable. This protects the reward information from being exploited by
methods such as inverse reinforcement learning. Existing studies that guarantee
differential privacy are not extendable to infinite state spaces, as the noise
level to ensure privacy will scale accordingly to infinity. Our aim is to
protect the value function approximator, without regard to the number of states
queried to the function. It is achieved by adding functional noise to the value
function iteratively in the training. We show rigorous privacy guarantees by a
series of analyses on the kernel of the noise space, the probabilistic bound of
such noise samples, and the composition over the iterations. We gain insight
into the utility analysis by proving the algorithm's approximate optimality
when the state space is discrete. Experiments corroborate our theoretical
findings and show improvement over existing approaches.

通过在训练中迭代地向价值函数添加函数噪声，本文在连续空间中考虑了保护差分隐私强化学习算法的价值函数逼近器，并证明了其隐私保证和近似最优性。

具函数噪声的连续状态空间中保护隐私的 Q 学习

Privacy-preserving Q-Learning with Functional Noise in Continuous State  Spaces

Goal recognition is the problem of inferring the goal of an agent, based on
its observed actions. An inspiring approach - plan recognition by planning
(PRP) - uses off-the-shelf planners to dynamically generate plans for given
goals, eliminating the need for the traditional plan library. However, existing
PRP formulation is inherently inefficient in online recognition, and cannot be
used with motion planners for continuous spaces. In this paper, we utilize a
different PRP formulation which allows for online goal recognition, and for
application in continuous spaces. We present an online recognition algorithm,
where two heuristic decision points may be used to improve run-time
significantly over existing work. We specify heuristics for continuous domains,
prove guarantees on their use, and empirically evaluate the algorithm over
hundreds of experiments in both a 3D navigational environment and a cooperative
robotic team task.

本文提出一种基于规划的计划识别方法，能够在线识别目标以及适用于连续空间，使用两个启发式决策点和连续环境的启发式策略来提高运行时效率。