Model-based reinforcement learning (RL) methods are appealing in the offline
setting because they allow an agent to reason about the consequences of actions
without interacting with the environment. Prior methods learn a 1-step dynamics
model, which predicts the next state given the current state and action. These
models do not immediately tell the agent which actions to take, but must be
integrated into a larger RL framework. Can we model the environment dynamics in
a different way, such that the learned model does directly indicate the value
of each action? In this paper, we propose Contrastive Value Learning (CVL),
which learns an implicit, multi-step model of the environment dynamics. This
model can be learned without access to reward functions, but nonetheless can be
used to directly estimate the value of each action, without requiring any TD
learning. Because this model represents the multi-step transitions implicitly,
it avoids having to predict high-dimensional observations and thus scales to
high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior
offline RL methods on complex continuous control benchmarks.

本文介绍了一种新的模型 - 基强化学习方法 Contrastive Value Learning 用于离线场景中，在不受奖励函数限制下，学习一个隐含的、多步骤的环境动力学模型，直接估计每个动作的价值，并在复杂的连续控制基准测试中优于先前的离线 RL 方法。

对比价值学习：简单离线强化学习的隐式模型

Contrastive Value Learning: Implicit Models for Simple Offline RL

In this paper, we present a coarse to fine question answering (CFQA) system
based on reinforcement learning which can efficiently processes documents with
different lengths by choosing appropriate actions. The system is designed using
an actor-critic based deep reinforcement learning model to achieve multi-step
question answering. Compared to previous QA models targeting on datasets mainly
containing either short or long documents, our multi-step coarse to fine model
takes the merits from multiple system modules, which can handle both short and
long documents. The system hence obtains a much better accuracy and faster
trainings speed compared to the current state-of-the-art models. We test our
model on four QA datasets, WIKEREADING, WIKIREADING LONG, CNN and SQuAD, and
demonstrate 1.3$\%$-1.7$\%$ accuracy improvements with 1.5x-3.4x training
speed-ups in comparison to the baselines using state-of-the-art models.

本文提出了一种基于强化学习的粗到精问答（CFQA）系统，使用多步骤的深度强化学习模型来处理文档，可以处理较短或较长的文档，相较于先前的 QA 模型，在 WIKEREADING、WIKIREADINGLONG、CNN 和 SQuAD 等四个 QA 数据集上得到了 1.3%-1.7% 的准确率和 1.5 倍至 3.4 倍的训练速度改善。