This paper focuses on Passable Obstacles Aware (POA) planner - a novel
navigation method for two-wheeled robots in a highly cluttered environment. The
navigation algorithm detects and classifies objects to distinguish two types of
obstacles - passable and unpassable. Our algorithm allows two-wheeled robots to
find a path through passable obstacles. Such a solution helps the robot working
in areas inaccessible to standard path planners and find optimal trajectories
in scenarios with a high number of objects in the robot's vicinity. The POA
planner can be embedded into other planning algorithms and enables them to
build a path through obstacles. Our method decreases path length and the total
travel time to the final destination up to 43% and 39%, respectively, comparing
to standard path planners such as GVD, A*, and RRT*

该文介绍了一种新颖的导航方法 —— Passable Obstacles Aware planner (POA)，可帮助双轮机器人在高度混乱的环境中寻找穿越通道，降低路径长度和总旅行时间，并可嵌入到其他计划算法中。

高度混杂环境下，用于导航双轮机器人的可通过障碍感知路径规划算法

POA: Passable Obstacles Aware Path-planning Algorithm for Navigation of  a Two-wheeled Robot in Highly Cluttered Environments

Substantial advancements to model-based reinforcement learning algorithms
have been impeded by the model-bias induced by the collected data, which
generally hurts performance. Meanwhile, their inherent sample efficiency
warrants utility for most robot applications, limiting potential damage to the
robot and its environment during training. Inspired by information theoretic
model predictive control and advances in deep reinforcement learning, we
introduce Model Predictive Actor-Critic (MoPAC), a hybrid
model-based/model-free method that combines model predictive rollouts with
policy optimization as to mitigate model bias. MoPAC leverages optimal
trajectories to guide policy learning, but explores via its model-free method,
allowing the algorithm to learn more expressive dynamics models. This
combination guarantees optimal skill learning up to an approximation error and
reduces necessary physical interaction with the environment, making it suitable
for real-robot training. We provide extensive results showcasing how our
proposed method generally outperforms current state-of-the-art and conclude by
evaluating MoPAC for learning on a physical robotic hand performing valve
rotation and finger gaiting--a task that requires grasping, manipulation, and
then regrasping of an object.

介绍了一种基于模型预测控制的混合模型学习和无模型学习方法，名为 MoPAC，通过探索 / 利用以减轻模型偏差，可以实现真实机器人的训练。该方法使用优化轨迹指导策略学习，并且在需要时进行探索。通过实验，MoPAC 方法优于当前最先进的方法，适用于真实机器人的训练，同时为物体夹取、操作和重新夹取等复杂任务提供了一种优化技能学习的解决方案。

深度强化学习下的模型预测行动者 - 评论家算法：加速机器人技能获取

Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with  Deep Reinforcement Learning

The concept of the value-gradient is introduced and developed in the context
of reinforcement learning. It is shown that by learning the value-gradients
exploration or stochastic behaviour is no longer needed to find locally optimal
trajectories. This is the main motivation for using value-gradients, and it is
argued that learning value-gradients is the actual objective of any
value-function learning algorithm for control problems. It is also argued that
learning value-gradients is significantly more efficient than learning just the
values, and this argument is supported in experiments by efficiency gains of
several orders of magnitude, in several problem domains. Once value-gradients
are introduced into learning, several analyses become possible. For example, a
surprising equivalence between a value-gradient learning algorithm and a
policy-gradient learning algorithm is proven, and this provides a robust
convergence proof for control problems using a value function with a general
function approximator.

该研究介绍和发展了价值梯度的概念在强化学习中的应用，证明了学习价值梯度对于控制问题的效率明显优于仅学习价值，证明了价值梯度学习算法和策略梯度学习算法之间的一个惊人等价。通过在几个问题域中实验，发现使用价值梯度可以使效率提升几个数量级，从而不再需要探索或随机行为来查找局部最优轨迹。