Purpose: Autonomous navigation of catheters and guidewires can enhance
endovascular surgery safety and efficacy, reducing procedure times and operator
radiation exposure. Integrating tele-operated robotics could widen access to
time-sensitive emergency procedures like mechanical thrombectomy (MT).
Reinforcement learning (RL) shows potential in endovascular navigation, yet its
application encounters challenges without a reward signal. This study explores
the viability of autonomous navigation in MT vasculature using inverse RL (IRL)
to leverage expert demonstrations. Methods: This study established a
simulation-based training and evaluation environment for MT navigation. We used
IRL to infer reward functions from expert behaviour when navigating a guidewire
and catheter. We utilized soft actor-critic to train models with various reward
functions and compared their performance in silico. Results: We demonstrated
feasibility of navigation using IRL. When evaluating single versus dual device
(i.e. guidewire versus catheter and guidewire) tracking, both methods achieved
high success rates of 95% and 96%, respectively. Dual-tracking, however,
utilized both devices mimicking an expert. A success rate of 100% and procedure
time of 22.6 s were obtained when training with a reward function obtained
through reward shaping. This outperformed a dense reward function (96%, 24.9 s)
and an IRL-derived reward function (48%, 59.2 s). Conclusions: We have
contributed to the advancement of autonomous endovascular intervention
navigation, particularly MT, by employing IRL. The results underscore the
potential of using reward shaping to train models, offering a promising avenue
for enhancing the accessibility and precision of MT. We envisage that future
research can extend our methodology to diverse anatomical structures to enhance
generalizability.

利用逆强化学习 (IRL) 探索了在机械血栓切除 (MT) 血管中实现自主导航的可行性，通过利用专家演示推断奖励函数并采用软演员 - 评论家算法 (soft actor-critic) 进行模型训练，结果表明通过奖励塑造来训练模型可以改进 MT 的可用性和精确性。

利用逆增强学习实现机械抽栓手术中导管和导丝的自主导航

Autonomous navigation of catheters and guidewires in mechanical  thrombectomy using inverse reinforcement learning

Applying reinforcement learning (RL) to sparse reward domains is notoriously
challenging due to insufficient guiding signals. Common RL techniques for
addressing such domains include (1) learning from demonstrations and (2)
curriculum learning. While these two approaches have been studied in detail,
they have rarely been considered together. This paper aims to do so by
introducing a principled task phasing approach that uses demonstrations to
automatically generate a curriculum sequence. Using inverse RL from
(suboptimal) demonstrations we define a simple initial task. Our task phasing
approach then provides a framework to gradually increase the complexity of the
task all the way to the target task, while retuning the RL agent in each
phasing iteration. Two approaches for phasing are considered: (1) gradually
increasing the proportion of time steps an RL agent is in control, and (2)
phasing out a guiding informative reward function. We present conditions that
guarantee the convergence of these approaches to an optimal policy.
Experimental results on 3 sparse reward domains demonstrate that our task
phasing approaches outperform state-of-the-art approaches with respect to
asymptotic performance.

本文介绍了一种基于任务分阶段的机器学习方法，通过逐步提高任务复杂度并调节反馈信息，针对稀疏奖励问题下的强化学习进行探索，并取得了较好成果。

任务分阶段：从示范中自动学习课程

Task Phasing: Automated Curriculum Learning from Demonstrations

Multi-task reinforcement learning (RL) aims to simultaneously learn policies
for solving many tasks. Several prior works have found that relabeling past
experience with different reward functions can improve sample efficiency.
Relabeling methods typically ask: if, in hindsight, we assume that our
experience was optimal for some task, for what task was it optimal? In this
paper, we show that hindsight relabeling is inverse RL, an observation that
suggests that we can use inverse RL in tandem for RL algorithms to efficiently
solve many tasks. We use this idea to generalize goal-relabeling techniques
from prior work to arbitrary classes of tasks. Our experiments confirm that
relabeling data using inverse RL accelerates learning in general multi-task
settings, including goal-reaching, domains with discrete sets of rewards, and
those with linear reward functions.

本文介绍了逆强化学习（inverse RL），采用逆强化学习方法来实现目标重标记技术（goal-relabeling techniques），并证实在多任务设置下，包括目标达成、具有离散奖励集合和线性奖励函数的领域中，使用逆强化学习加速了学习过程。

用逆强化学习改写历史：后见推断对政策改进的影响

Rewriting History with Inverse RL: Hindsight Inference for Policy  Improvement

Deep reinforcement learning (RL) can acquire complex behaviors from low-level
inputs, such as images. However, real-world applications of such methods
require generalizing to the vast variability of the real world. Deep networks
are known to achieve remarkable generalization when provided with massive
amounts of labeled data, but can we provide this breadth of experience to an RL
agent, such as a robot? The robot might continuously learn as it explores the
world around it, even while deployed. However, this learning requires access to
a reward function, which is often hard to measure in real-world domains, where
the reward could depend on, for example, unknown positions of objects or the
emotional state of the user. Conversely, it is often quite practical to provide
the agent with reward functions in a limited set of situations, such as when a
human supervisor is present or in a controlled setting. Can we make use of this
limited supervision, and still benefit from the breadth of experience an agent
might collect on its own? In this paper, we formalize this problem as
semisupervised reinforcement learning, where the reward function can only be
evaluated in a set of "labeled" MDPs, and the agent must generalize its
behavior to the wide range of states it might encounter in a set of "unlabeled"
MDPs, by using experience from both settings. Our proposed method infers the
task objective in the unlabeled MDPs through an algorithm that resembles
inverse RL, using the agent's own prior experience in the labeled MDPs as a
kind of demonstration of optimal behavior. We evaluate our method on
challenging tasks that require control directly from images, and show that our
approach can improve the generalization of a learned deep neural network policy
by using experience for which no reward function is available. We also show
that our method outperforms direct supervised learning of the reward.

本文研究了如何在有限的标注数据下，通过半监督强化学习及反强化学习等方法，使机器人等强化学习智能体在探索未知领域时能够获得更好的泛化效果，并评估了该方法在基于图像的控制任务上的表现。