Adversarial imitation learning (AIL) has stood out as a dominant framework
across various imitation learning (IL) applications, with Discriminator Actor
Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of
off-policy learning algorithms in improving sample efficiency and scalability
to higher-dimensional observations. Despite DAC's empirical success, the
original AIL objective is on-policy and DAC's ad-hoc application of off-policy
training does not guarantee successful imitation (Kostrikov et al., 2019;
2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this
issue by deriving a fully off-policy AIL objective. Instead in this work, we
develop a novel and principled AIL algorithm via the framework of boosting.
Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly
weighted weak learners (i.e., policies) and trains a discriminator that
witnesses the maximum discrepancy between the distributions of the ensemble and
the expert policy. We maintain a weighted replay buffer to represent the
state-action distribution induced by the ensemble, allowing us to train
discriminators using the entire data collected so far. In the weighted replay
buffer, the contribution of the data from older policies are properly
discounted with the weight computed based on the boosting framework.
Empirically, we evaluate our algorithm on both controller state-based and
pixel-based environments from the DeepMind Control Suite. AILBoost outperforms
DAC on both types of environments, demonstrating the benefit of properly
weighting replay buffer data for off-policy training. On state-based
environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021),
achieving competitive performance with as little as one expert trajectory.

通过建立加权回放缓冲区的新算法 AILBoost，该文研究了对抗性模仿学习在离策略训练中的有效性，实验证明 AILBoost 相较于 DAC 在控制器状态和像素环境中性能更佳。

通过提升实现对抗模仿学习

Adversarial Imitation Learning via Boosting

Computational units in artificial neural networks compute a linear
combination of their inputs, and then apply a nonlinear filter, often a ReLU
shifted by some bias, and if the inputs come themselves from other units, they
were already filtered with their own biases. In a layer, multiple units share
the same inputs, and each input was filtered with a unique bias, resulting in
output values being based on shared input biases rather than individual optimal
ones. To mitigate this issue, we introduce DAC, a new computational unit based
on preactivation and multiple biases, where input signals undergo independent
nonlinear filtering before the linear combination. We provide a Keras
implementation and report its computational efficiency. We test DAC
convolutions in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and
Imagewoof, and achieve performance improvements of up to 1.73%. We exhibit
examples where DAC is more efficient than its standard counterpart as a
function approximator, and we prove a universal representation theorem.

本研究介绍了一种基于预激活和多个偏差的新型计算单元 DAC，用于减轻神经网络中多个单元共享输入偏差的问题，并在 ResNet 架构中测试 DAC 卷积，取得了最高 1.73% 的性能提升。

通过激活树突连接改善神经网络性能

Improving Performance in Neural Networks by Dendrites-Activated  Connections

The age of information metric fails to correctly describe the intrinsic
semantics of a status update. In an intelligent reflecting surface-aided
cooperative relay communication system, we propose the age of semantics (AoS)
for measuring semantics freshness of the status updates. Specifically, we focus
on the status updating from a source node (SN) to the destination, which is
formulated as a Markov decision process (MDP). The objective of the SN is to
maximize the expected satisfaction of AoS and energy consumption under the
maximum transmit power constraint. To seek the optimal control policy, we first
derive an online deep actor-critic (DAC) learning scheme under the on-policy
temporal difference learning framework. However, implementing the online DAC in
practice poses the key challenge in infinitely repeated interactions between
the SN and the system, which can be dangerous particularly during the
exploration. We then put forward a novel offline DAC scheme, which estimates
the optimal control policy from a previously collected dataset without any
further interactions with the system. Numerical experiments verify the
theoretical results and show that our offline DAC scheme significantly
outperforms the online DAC scheme and the most representative baselines in
terms of mean utility, demonstrating strong robustness to dataset quality.

提出一种称为 AoS 的信令更新语义新鲜度的度量方式，针对节点更新源和目的地之间的状态更新的最优控制策略进行研究，提出了在线和离线的深度演员 - 评论家算法，离线算法在数据集质量方面表现出强大的鲁棒性。