Offline-to-online reinforcement learning (RL), by combining the benefits of
offline pretraining and online finetuning, promises enhanced sample efficiency
and policy performance. However, existing methods, effective as they are,
suffer from suboptimal performance, limited adaptability, and unsatisfactory
computational efficiency. We propose a novel framework, PROTO, which overcomes
the aforementioned limitations by augmenting the standard RL objective with an
iteratively evolving regularization term. Performing a trust-region-style
update, PROTO yields stable initial finetuning and optimal final performance by
gradually evolving the regularization term to relax the constraint strength. By
adjusting only a few lines of code, PROTO can bridge any offline policy
pretraining and standard off-policy RL finetuning to form a powerful
offline-to-online RL pathway, birthing great adaptability to diverse methods.
Simple yet elegant, PROTO imposes minimal additional computation and enables
highly efficient online finetuning. Extensive experiments demonstrate that
PROTO achieves superior performance over SOTA baselines, offering an adaptable
and efficient offline-to-online RL framework.

PROTO 使用逐步演化的正则化项优化标准 RL 目标，实现离线到在线 RL 的路径，与各种方法高度适应并具有高效的在线调整性能。

PROTO: 迭代策略规范化离线到在线强化学习

PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement  Learning

Programs, consisting of semantic and structural information, play an
important role in the communication between humans and agents. Towards learning
general program executors to unify perception, reasoning, and decision making,
we formulate program-guided tasks which require learning to execute a given
program on the observed task specification. Furthermore, we propose the
Program-guided Transformer (ProTo), which integrates both semantic and
structural guidance of a program by leveraging cross-attention and masked
self-attention to pass messages between the specification and routines in the
program. ProTo executes a program in a learned latent space and enjoys stronger
representation ability than previous neural-symbolic approaches. We demonstrate
that ProTo significantly outperforms the previous state-of-the-art methods on
GQA visual reasoning and 2D Minecraft policy learning datasets. Additionally,
ProTo demonstrates better generalization to unseen, complex, and human-written
programs.

通过程序引导任务进行学习，ProTo 结合语义和结构指导，并通过跨注意力和掩码自我注意力在规范和程序中的例程之间传递消息。在 GQA 视觉推理和 2D Minecraft 策略学习数据集上，ProTo 明显优于先前现有的最先进方法，并表现出更好的泛化能力。