We study an approach to offline reinforcement learning (RL) based on
optimally solving finitely-represented MDPs derived from a static dataset of
experience. This approach can be applied on top of any learned representation
and has the potential to easily support multiple solution objectives as well as
zero-shot adjustment to changing environments and goals. Our main contribution
is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate
its solutions for offline RL. DAC-MDPs are a non-parametric model that can
leverage deep representations and account for limited data by introducing costs
for exploiting under-represented parts of the model. In theory, we show
conditions that allow for lower-bounding the performance of DAC-MDP solutions.
We also investigate the empirical behavior in a number of environments,
including those with image-based observations. Overall, the experiments
demonstrate that the framework can work in practice and scale to large complex
offline RL problems.

研究了一种离线强化学习方法，在静态数据集的基础上通过有效解决有限表示 MDPs 的方式进行。该方法可应用于任何学习表示，并具有支持多种解决方案、零成本调整等特性；其主要贡献是引入了 Deep Averagers with Costs MDP，并研究了其在离线强化学习方面的解决方案。实验证明这种方法在实践中可以发挥作用，并可扩展到大型复杂的离线 RL 问题。