One of the common ways children learn is by mimicking adults. Imitation
learning focuses on learning policies with suitable performance from
demonstrations generated by an expert, with an unspecified performance measure,
and unobserved reward signal. Popular methods for imitation learning start by
either directly mimicking the behavior policy of an expert (behavior cloning)
or by learning a reward function that prioritizes observed expert trajectories
(inverse reinforcement learning). However, these methods rely on the assumption
that covariates used by the expert to determine her/his actions are fully
observed. In this paper, we relax this assumption and study imitation learning
when sensory inputs of the learner and the expert differ. First, we provide a
non-parametric, graphical criterion that is complete (both necessary and
sufficient) for determining the feasibility of imitation from the combinations
of demonstration data and qualitative assumptions about the underlying
environment, represented in the form of a causal model. We then show that when
such a criterion does not hold, imitation could still be feasible by exploiting
quantitative knowledge of the expert trajectories. Finally, we develop an
efficient procedure for learning the imitating policy from experts'
trajectories.

研究表明在模仿学习中，学者提出了一个非参数图形标准来确定模仿的可行性，并建立了一个有效的程序来从专家轨迹中学习模仿策略。

未观测到混杂因素的因果模仿学习

Causal Imitation Learning with Unobserved Confounders

We study an extension of standard bandit problem in which there are R layers
of experts. Multi-layered experts make selections layer by layer and only the
experts in the last layer can play arms. The goal of the learning policy is to
minimize the total regret in this hierarchical experts setting. We first
analyze the case that total regret grows linearly with the number of layers.
Then we focus on the case that all experts are playing Upper Confidence Bound
(UCB) strategy and give several sub-linear upper bounds for different
circumstances. Finally, we design some experiments to help the regret analysis
for the general case of hierarchical UCB structure and show the practical
significance of our theoretical results. This article gives many insights about
reasonable hierarchical decision structure.

本文研究了一种扩展的标准赌博机问题，其中有 R 层专家。多层专家按层选择，只有最后一层的专家才能玩臂。学习策略的目标是在这种分层专家情况下，尽可能减少总遗憾。本文首先分析遗憾总数与层数线性增长的情况。然后，我们专注于所有专家都在进行 Upper Confidence Bound（UCB）策略的情况，并为不同情况给出多种次线性上限。最后，我们设计了一些实验，以帮助对分层 UCB 结构的遗憾分析，并展示了我们理论结果的实际意义。

层次专家赌博问题的遗憾分析

Regret Analysis for Hierarchical Experts Bandit Problem

We study the online restless bandit problem, where the state of each arm
evolves according to a Markov chain, and the reward of pulling an arm depends
on both the pulled arm and the current state of the corresponding Markov chain.
In this paper, we propose Restless-UCB, a learning policy that follows the
explore-then-commit framework. In Restless-UCB, we present a novel method to
construct offline instances, which only requires $O(N)$ time-complexity ($N$ is
the number of arms) and is exponentially better than the complexity of existing
learning policy. We also prove that Restless-UCB achieves a regret upper bound
of $\tilde{O}((N+M^3)T^{2\over 3})$, where $M$ is the Markov chain state space
size and $T$ is the time horizon. Compared to existing algorithms, our result
eliminates the exponential factor (in $M,N$) in the regret upper bound, due to
a novel exploitation of the sparsity in transitions in general restless bandit
problems. As a result, our analysis technique can also be adopted to tighten
the regret bounds of existing algorithms. Finally, we conduct experiments based
on real-world dataset, to compare the Restless-UCB policy with state-of-the-art
benchmarks. Our results show that Restless-UCB outperforms existing algorithms
in regret, and significantly reduces the running time.

提出了一种名为 Restless-UCB 的在线学习策略来解决在线探索期望最大化问题，在 Restless-UCB 中，利用前期的探索来做出更好地决策，证明了期望最大化问题在合理的标准下得到了可行的上界，相较于现有算法，使用一种新的对于状态转移进行利用的方法来消除在泊松极限中出现的指数因子，同时也能用于优化现有算法。