Generative Adversarial Imitation Learning (GAIL) is a powerful and practical
approach for learning sequential decision-making policies. Different from
Reinforcement Learning (RL), GAIL takes advantage of demonstration data by
experts (e.g., human), and learns both the policy and reward function of the
unknown environment. Despite the significant empirical progresses, the theory
behind GAIL is still largely unknown. The major difficulty comes from the
underlying temporal dependency of the demonstration data and the minimax
computational formulation of GAIL without convex-concave structure. To bridge
such a gap between theory and practice, this paper investigates the theoretical
properties of GAIL. Specifically, we show: (1) For GAIL with general reward
parameterization, the generalization can be guaranteed as long as the class of
the reward functions is properly controlled; (2) For GAIL, where the reward is
parameterized as a reproducing kernel function, GAIL can be efficiently solved
by stochastic first order optimization algorithms, which attain sublinear
convergence to a stationary solution. To the best of our knowledge, these are
the first results on statistical and computational guarantees of imitation
learning with reward/policy function approximation. Numerical experiments are
provided to support our analysis.

本文探讨基于生成式对抗性模型的模仿学习算法（Generative Adversarial Imitation Learning，GAIL）的理论性质，证明了对于一般的回报参数化形式，只要正确控制奖励函数的类别，就可以保证泛化效果，并且利用再生核函数对奖励进行参数化可以使用随机一阶优化算法高效解决，并具有次线性收敛性，这是关于奖励 / 策略函数逼近的统计和计算保证的第一篇研究。

生成对抗模仿学习的计算与泛化

On Computation and Generalization of Generative Adversarial Imitation  Learning

We study how to effectively leverage expert feedback to learn sequential
decision-making policies. We focus on problems with sparse rewards and long
time horizons, which typically pose significant challenges in reinforcement
learning. We propose an algorithmic framework, called hierarchical guidance,
that leverages the hierarchical structure of the underlying problem to
integrate different modes of expert interaction. Our framework can incorporate
different combinations of imitation learning (IL) and reinforcement learning
(RL) at different levels, leading to dramatic reductions in both expert effort
and cost of exploration. Using long-horizon benchmarks, including Montezuma's
Revenge, we demonstrate that our approach can learn significantly faster than
hierarchical RL, and be significantly more label-efficient than standard IL. We
also theoretically analyze labeling cost for certain instantiations of our
framework.

论文提出了一种算法框架，称为层次性指导，旨在通过利用专家的反馈来学习序贯决策制定策略，以应对奖励稀疏和时间范围长等问题，该框架可以在不同的层次上组合不同的 imitation learning 和 reinforcement learning，从而大大减少专家工作量和探索成本，并对该框架的某些实例的标注成本进行了理论分析。