Reinforcement learning in partially observed Markov decision processes
(POMDPs) faces two challenges. (i) It often takes the full history to predict
the future, which induces a sample complexity that scales exponentially with
the horizon. (ii) The observation and state spaces are often continuous, which
induces a sample complexity that scales exponentially with the extrinsic
dimension. Addressing such challenges requires learning a minimal but
sufficient representation of the observation and state histories by exploiting
the structure of the POMDP.
To this end, we propose a reinforcement learning algorithm named Embed to
Control (ETC), which learns the representation at two levels while optimizing
the policy.~(i) For each step, ETC learns to represent the state with a
low-dimensional feature, which factorizes the transition kernel. (ii) Across
multiple steps, ETC learns to represent the full history with a low-dimensional
embedding, which assembles the per-step feature. We integrate (i) and (ii) in a
unified framework that allows a variety of estimators (including maximum
likelihood estimators and generative adversarial networks). For a class of
POMDPs with a low-rank structure in the transition kernel, ETC attains an
$O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon
and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the
optimality gap. To our best knowledge, ETC is the first sample-efficient
algorithm that bridges representation learning and policy optimization in
POMDPs with infinite observation and state spaces.

提出了一种名为 Embed to Control (ETC) 的强化学习算法，通过学习观察和状态历史的最小但足够的表示来解决部分观测 Markov 决策过程（POMDP）中的样本复杂性问题，实现了表示学习和策略优化的桥梁，具有高效的样本复杂度，适用于具有低秩结构的 POMDP 问题。

嵌入式控制部分观测系统：带有可证明采样效率的表示学习

Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

Many medical decision-making tasks can be framed as partially observed Markov
decision processes (POMDPs). However, prevailing two-stage approaches that
first learn a POMDP and then solve it often fail because the model that best
fits the data may not be well suited for planning. We introduce a new
optimization objective that (a) produces both high-performing policies and
high-quality generative models, even when some observations are irrelevant for
planning, and (b) does so in batch off-policy settings that are typical in
healthcare, when only retrospective data is available. We demonstrate our
approach on synthetic examples and a challenging medical decision-making
problem.

本文提出了一种新的优化目标，以批处理离线策略为特点，即使在某些观测对于规划无关紧要时，该方法也能产生高性能策略和高质量的生成模型，并将其应用于合成样例和一个具有挑战性的医疗决策问题。