In generative adversarial imitation learning (GAIL), the agent aims to learn
a policy from an expert demonstration so that its performance cannot be
discriminated from the expert policy on a certain predefined reward set. In
this paper, we study GAIL in both online and offline settings with linear
function approximation, where both the transition and reward function are
linear in the feature maps. Besides the expert demonstration, in the online
setting the agent can interact with the environment, while in the offline
setting the agent only accesses an additional dataset collected by a prior. For
online GAIL, we propose an optimistic generative adversarial policy
optimization algorithm (OGAP) and prove that OGAP achieves
$\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here
$N_1$ represents the number of trajectories of the expert demonstration, $d$ is
the feature dimension, and $K$ is the number of episodes.
For offline GAIL, we propose a pessimistic generative adversarial policy
optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain
the optimality gap of PGAP, achieving the minimax lower bound in the
utilization of the additional dataset. Assuming sufficient coverage on the
additional dataset, we show that PGAP achieves
$\widetilde{\mathcal{O}}(H^{2}dK^{-1/2}
+H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$
represents the number of trajectories of the additional dataset with sufficient
coverage.

本文研究了在线和离线线性情况下生成对抗模仿学习，提出了乐观和悲观的生成对抗策略优化算法，并证明了算法的收敛性和误差界。