Adversarial imitation learning (AIL) has stood out as a dominant framework
across various imitation learning (IL) applications, with Discriminator Actor
Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of
off-policy learning algorithms in improving sample efficiency and scalability
to higher-dimensional observations. Despite DAC's empirical success, the
original AIL objective is on-policy and DAC's ad-hoc application of off-policy
training does not guarantee successful imitation (Kostrikov et al., 2019;
2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this
issue by deriving a fully off-policy AIL objective. Instead in this work, we
develop a novel and principled AIL algorithm via the framework of boosting.
Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly
weighted weak learners (i.e., policies) and trains a discriminator that
witnesses the maximum discrepancy between the distributions of the ensemble and
the expert policy. We maintain a weighted replay buffer to represent the
state-action distribution induced by the ensemble, allowing us to train
discriminators using the entire data collected so far. In the weighted replay
buffer, the contribution of the data from older policies are properly
discounted with the weight computed based on the boosting framework.
Empirically, we evaluate our algorithm on both controller state-based and
pixel-based environments from the DeepMind Control Suite. AILBoost outperforms
DAC on both types of environments, demonstrating the benefit of properly
weighting replay buffer data for off-policy training. On state-based
environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021),
achieving competitive performance with as little as one expert trajectory.

通过建立加权回放缓冲区的新算法 AILBoost，该文研究了对抗性模仿学习在离策略训练中的有效性，实验证明 AILBoost 相较于 DAC 在控制器状态和像素环境中性能更佳。

通过提升实现对抗模仿学习

Adversarial Imitation Learning via Boosting

Baird counterexample was proposed by Leemon Baird in 1995, first used to show
that the Temporal Difference (TD(0)) algorithm diverges on this example. Since
then, it is often used to test and compare off-policy learning algorithms.
Gradient TD algorithms solved the divergence issue of TD on Baird
counterexample. However, their convergence on this example is still very slow,
and the nature of the slowness is not well understood, e.g., see (Sutton and
Barto 2018).
This note is to understand in particular, why TDC is slow on this example,
and provide debugging analysis to understand this behavior. Our debugging
technique can be used to study the convergence behavior of two-time-scale
stochastic approximation algorithms. We also provide empirical results of the
recent Impression GTD algorithm on this example, showing the convergence is
very fast, in fact, in a linear rate. We conclude that Baird counterexample is
solved, by an algorithm with convergence guarantee to the TD solution in
general and a fast convergence rate.

针对 Baird 反例问题，研究解决了 TD 算法在该问题上发散以及收敛速度缓慢的问题，并提出了具有收敛保证和快速收敛率的算法解决了 Baird 反例问题。