Discovering an informative, or agent-centric, state representation that
encodes only the relevant information while discarding the irrelevant is a key
challenge towards scaling reinforcement learning algorithms and efficiently
applying them to downstream tasks. Prior works studied this problem in
high-dimensional Markovian environments, when the current observation may be a
complex object but is sufficient to decode the informative state. In this work,
we consider the problem of discovering the agent-centric state in the more
challenging high-dimensional non-Markovian setting, when the state can be
decoded from a sequence of past observations. We establish that generalized
inverse models can be adapted for learning agent-centric state representation
for this task. Our results include asymptotic theory in the deterministic
dynamics setting as well as counter-examples for alternative intuitive
algorithms. We complement these findings with a thorough empirical study on the
agent-centric state discovery abilities of the different alternatives we put
forward. Particularly notable is our analysis of past actions, where we show
that these can be a double-edged sword: making the algorithms more successful
when used correctly and causing dramatic failure when used incorrectly.

学习代理中心状态表示的关键挑战在于在强化学习算法扩展和高效应用于下游任务时，仅对相关信息进行编码而舍弃无关信息。该研究考虑在更具挑战性的高维非马尔可夫环境中，从过去观察序列中译码状态的发现代理中心状态问题，并通过适应广义逆模型来解决此任务。研究结果包括确定性动力学环境下的渐近理论以及对替代直观算法的反例。我们通过对所提出的不同替代方案的代理中心状态发现能力进行了彻底的实证研究，其中对过去行动的分析尤为引人注目：我们表明，当正确使用时，过去行动可以使算法更加成功，而错误使用时则会导致严重的失败。

对有限记忆 POMDP 的表示学习进行多步逆模型的泛化

Generalizing Multi-Step Inverse Models for Representation Learning to  Finite-Memory POMDPs

Since reward functions are hard to specify, recent work has focused on
learning policies from human feedback. However, such approaches are impeded by
the expense of acquiring such feedback. Recent work proposed that agents have
access to a source of information that is effectively free: in any environment
that humans have acted in, the state will already be optimized for human
preferences, and thus an agent can extract information about what humans want
from the state. Such learning is possible in principle, but requires simulating
all possible past trajectories that could have led to the observed state. This
is feasible in gridworlds, but how do we scale it to complex tasks? In this
work, we show that by combining a learned feature encoder with learned inverse
models, we can enable agents to simulate human actions backwards in time to
infer what they must have done. The resulting algorithm is able to reproduce a
specific skill in MuJoCo environments given a single state sampled from the
optimal policy for that skill.

本文摘要：本研究的目的是基于人类反馈对智能体进行政策学习，同时通过学习特征编码器结合学习反向模型，从而使得智能体能够向后模拟人类行为以推断人类行为背后的动机。