Temporal-Difference (TD) learning is a general and very useful tool for
estimating the value function of a given policy, which in turn is required to
find good policies. Generally speaking, TD learning updates states whenever
they are visited. When the agent lands in a state, its value can be used to
compute the TD-error, which is then propagated to other states. However, it may
be interesting, when computing updates, to take into account other information
than whether a state is visited or not. For example, some states might be more
important than others (such as states which are frequently seen in a successful
trajectory). Or, some states might have unreliable value estimates (for
example, due to partial observability or lack of data), making their values
less desirable as targets. We propose an approach to re-weighting states used
in TD updates, both when they are the input and when they provide the target
for the update. We prove that our approach converges with linear function
approximation and illustrate its desirable empirical behaviour compared to
other TD-style methods.

在 TD 学习中，提出一种重新加权状态的方法，在更新方程中考虑到其重要性和价值估计的可靠性，证明此方法在线性函数逼近下收敛，并在实验中与其他 TD 方法进行比较。

优先级时间差分学习

Preferential Temporal Difference Learning

We study an approach to offline reinforcement learning (RL) based on
optimally solving finitely-represented MDPs derived from a static dataset of
experience. This approach can be applied on top of any learned representation
and has the potential to easily support multiple solution objectives as well as
zero-shot adjustment to changing environments and goals. Our main contribution
is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate
its solutions for offline RL. DAC-MDPs are a non-parametric model that can
leverage deep representations and account for limited data by introducing costs
for exploiting under-represented parts of the model. In theory, we show
conditions that allow for lower-bounding the performance of DAC-MDP solutions.
We also investigate the empirical behavior in a number of environments,
including those with image-based observations. Overall, the experiments
demonstrate that the framework can work in practice and scale to large complex
offline RL problems.

研究了一种离线强化学习方法，在静态数据集的基础上通过有效解决有限表示 MDPs 的方式进行。该方法可应用于任何学习表示，并具有支持多种解决方案、零成本调整等特性；其主要贡献是引入了 Deep Averagers with Costs MDP，并研究了其在离线强化学习方面的解决方案。实验证明这种方法在实践中可以发挥作用，并可扩展到大型复杂的离线 RL 问题。