We propose Heuristic Blending (HUBL), a simple performance-improving technique for a broad class of offline RL algorithms based on value bootstrapping. HUBL modifies Bellman operators used in these algorithms, partially replacing the bootstrapped values with Monte-Carlo returns as heuristics. For trajectories with higher returns, HUBL relies more on heuristics and less on bootstrapping; otherwise, it leans more heavily on bootstrapping. We show that this idea can be easily implemented by relabeling the offline datasets with adjusted rewards and discount factors, making HUBL readily usable by many existing offline RL implementations. We theoretically prove that HUBL reduces offline RL's complexity and thus improves its finite-sample performance. Furthermore, we empirically demonstrate that HUBL consistently improves the policy quality of four state-of-the-art bootstrapping-based offline RL algorithms (ATAC, CQL, TD3+BC, and IQL), by 9% on average over 27 datasets of the D4RL and Meta-World benchmarks.

提出启发式混合（HUBL）改进基于值引导的广泛类离线强化学习算法的简单性能​​技术，通过将部分引导的值替换为启发式的蒙特卡罗回报，实现了算法中使用的Bellman算子的修改。 我们通过调整奖励和折扣因子来重新标记离线数据集来实现此想法，理论上证明了HUBL降低了离线RL的复杂性，从而改善了其有限样本的表现，并经验证明HUBL通过27个D4RL和Meta-World基准数据集的平均值提高了四种现有算法（ATAC，CQL，TD3+BC和IQL）的策略质量9％。

通过融合启发式方法来改善离线强化学习