Multi-task reinforcement learning endeavors to accomplish a set of different
tasks with a single policy. To enhance data efficiency by sharing parameters
across multiple tasks, a common practice segments the network into distinct
modules and trains a routing network to recombine these modules into
task-specific policies. However, existing routing approaches employ a fixed
number of modules for all tasks, neglecting that tasks with varying
difficulties commonly require varying amounts of knowledge. This work presents
a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of
certain intermediate modules, thereby flexibly choosing different numbers of
modules for each task. Under this framework, we further introduce a ResRouting
method to address the issue of disparate routing paths between behavior and
target policies during off-policy training. In addition, we design an automatic
route-balancing mechanism to encourage continued routing exploration for
unmastered tasks without disturbing the routing of mastered ones. We conduct
extensive experiments on various robotics manipulation tasks in the Meta-World
benchmark, where D2R achieves state-of-the-art performance with significantly
improved learning efficiency.

该研究通过动态深度路由（D2R）框架实现多任务强化学习，其中通过绕过中间模块灵活选择不同数量的模块来提高数据效率并解决不同策略的路由路径问题。该框架进一步引入 ResRouting 方法解决行为策略和目标策略在离策略训练过程中的差异路由路径问题，并设计了自动的路由平衡机制来促进未掌握任务的继续路由探索。在 Meta-World 基准测试中，通过该框架在各种机器人操作任务上进行了广泛实验，取得了具有显著提高的学习效率的最新成果。