Conventional imitation learning assumes access to the actions of
demonstrators, but these motor signals are often non-observable in naturalistic
settings. Additionally, sequential decision-making behaviors in these settings
can deviate from the assumptions of a standard Markov Decision Process (MDP).
To address these challenges, we explore deep generative modeling of state-only
sequences with non-Markov Decision Process (nMDP), where the policy is an
energy-based prior in the latent space of the state transition generator. We
develop maximum likelihood estimation to achieve model-based imitation, which
involves short-run MCMC sampling from the prior and importance sampling for the
posterior. The learned model enables \textit{decision-making as inference}:
model-free policy execution is equivalent to prior sampling, model-based
planning is posterior sampling initialized from the policy. We demonstrate the
efficacy of the proposed method in a prototypical path planning task with
non-Markovian constraints and show that the learned model exhibits strong
performances in challenging domains from the MuJoCo suite.

本文研究了一种在非 Markov 决策过程下进行深度生成建模的模型，在无法观察到动作的情况下进行模仿学习和决策制定。

从只包含状态序列中学习非 Markov 决策制定

Learning non-Markovian Decision-Making from State-only Sequences

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes
a parameterized policy model for an expected return using gradient ascent.
Given a well-parameterized policy model, such as a neural network model, with
appropriate initial parameters, the PG algorithms work well even when
environment does not have the Markov property. Otherwise, they can be trapped
on a plateau or suffer from peakiness effects. As another successful RL
approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include
AlphaZero, have obtained groundbreaking results especially on the board game
playing domain. They are also suitable to be applied to non-Markov decision
processes. However, since the standard MCTS does not have the ability to learn
state representation, the size of the tree-search space can be too large to
search. In this work, we examine a mixture policy of PG and MCTS to complement
each other's difficulties and take advantage of them. We derive conditions for
asymptotic convergence with results of a two-timescale stochastic approximation
and propose an algorithm that satisfies these conditions. The effectivity of
the proposed methods is verified through numerical experiments on non-Markov
decision processes.

本文介绍一种结合 Policy Gradient 和 Monte-Carlo Tree Search 的混合策略，旨在克服两种方法在应对非马尔可夫决策过程上的困难，有效提升算法的效率。

使用蒙特卡罗树搜索的策略梯度算法用于非马尔可夫决策过程

Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes

Recently regular decision processes have been proposed as a well-behaved form
of non-Markov decision process. Regular decision processes are characterised by
a transition function and a reward function that depend on the whole history,
though regularly (as in regular languages). In practice both the transition and
the reward functions can be seen as finite transducers. We study reinforcement
learning in regular decision processes. Our main contribution is to show that a
near-optimal policy can be PAC-learned in polynomial time in a set of
parameters that describe the underlying decision process. We argue that the
identified set of parameters is minimal and it reasonably captures the
difficulty of a regular decision process.

本文探讨了如何在正则决策流程中实现强化学习，并提出了在一组参数中可以 PAC - 学习这一流程的最优策略。