Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that's far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.

本研究针对决策变换器在线微调不足的问题进行了理论分析，指出传统的回报期望计算方法对微调过程的负面影响。通过实验证明，将TD3梯度加入在线决策变换器的微调过程显著提升了其在线微调性能，尤其是在低奖励离线数据预训练的情况下。这为进一步改善决策变换器提供了新的方向。

强化学习梯度提升在线微调决策变换器