Offline reinforcement learning (RL) can in principle synthesize more optimal
behavior from a dataset consisting only of suboptimal trials. One way that this
can happen is by "stitching" together the best parts of otherwise suboptimal
trajectories that overlap on similar states, to create new behaviors where each
individual state is in-distribution, but the overall returns are higher.
However, in many interesting and complex applications, such as autonomous
navigation and dialogue systems, the state is partially observed. Even worse,
the state representation is unknown or not easy to define. In such cases,
policies and value functions are often conditioned on observation histories
instead of states. In these cases, it is not clear if the same kind of
"stitching" is feasible at the level of observation histories, since two
different trajectories would always have different histories, and thus "similar
states" that might lead to effective stitching cannot be leveraged.
Theoretically, we show that standard offline RL algorithms conditioned on
observation histories suffer from poor sample complexity, in accordance with
the above intuition. We then identify sufficient conditions under which offline
RL can still be efficient -- intuitively, it needs to learn a compact
representation of history comprising only features relevant for action
selection. We introduce a bisimulation loss that captures the extent to which
this happens, and propose that offline RL can explicitly optimize this loss to
aid worst-case sample complexity. Empirically, we show that across a variety of
tasks either our proposed loss improves performance, or the value of this loss
is already minimized as a consequence of standard offline RL, indicating that
it correlates well with good performance.

标准离线强化学习算法在观测历史的条件下存在样本复杂度高的问题，然而通过引入双模拟损失函数，离线强化学习可以显式地优化该损失函数，从而在性能上得到改善。

基于观测历史的离线强化学习：分析和改善采样复杂度

Offline RL with Observation Histories: Analyzing and Improving Sample  Complexity

Imitation learning trains control policies by mimicking pre-recorded expert
demonstrations. In partially observable settings, imitation policies must rely
on observation histories, but many seemingly paradoxical results show better
performance for policies that only access the most recent observation. Recent
solutions ranging from causal graph learning to deep information bottlenecks
have shown promising results, but failed to scale to realistic settings such as
visual imitation. We propose a solution that outperforms these prior approaches
by upweighting demonstration keyframes corresponding to expert action
changepoints. This simple approach easily scales to complex visual imitation
settings. Our experimental results demonstrate consistent performance
improvements over all baselines on image-based Gym MuJoCo continuous control
tasks. Finally, on the CARLA photorealistic vision-based urban driving
simulator, we resolve a long-standing issue in behavioral cloning for driving
by demonstrating effective imitation from observation histories. Supplementary
materials and code at: https://tinyurl.com/imitation-keyframes.

该研究提出了一种通过加强模仿的关键帧来改进模仿学习的方法，以在视觉模仿等现实场景中实现更好的性能表现，并在基于图像和基于视觉的控制任务中进行了验证。