Quality-Diversity (QD) algorithms have emerged as a powerful optimization
paradigm with the aim of generating a set of high-quality and diverse
solutions. To achieve such a challenging goal, QD algorithms require
maintaining a large archive and a large population in each iteration, which
brings two main issues, sample and resource efficiency. Most advanced QD
algorithms focus on improving the sample efficiency, while the resource
efficiency is overlooked to some extent. Particularly, the resource overhead
during the training process has not been touched yet, hindering the wider
application of QD algorithms. In this paper, we highlight this important
research question, i.e., how to efficiently train QD algorithms with limited
resources, and propose a novel and effective method called RefQD to address it.
RefQD decomposes a neural network into representation and decision parts, and
shares the representation part with all decision parts in the archive to reduce
the resource overhead. It also employs a series of strategies to address the
mismatch issue between the old decision parts and the newly updated
representation part. Experiments on different types of tasks from small to
large resource consumption demonstrate the excellent performance of RefQD: it
not only uses significantly fewer resources (e.g., 16\% GPU memories on QDax
and 3.7\% on Atari) but also achieves comparable or better performance compared
to sample-efficient QD algorithms. Our code is available at
https://github.com/lamda-bbo/RefQD.

如何用有限资源高效训练品质多样性算法（QD algorithms），这篇论文提出了一种名为 RefQD 的新方法，通过将神经网络分解为表示部分和决策部分，并在存档中共享表示部分，以减少资源开销。RefQD 在各种资源消耗大小的任务上进行的实验表明了其出色的性能。

有限资源下的优质多样性

Quality-Diversity with Limited Resources

Bandit problems with linear or concave reward have been extensively studied,
but relatively few works have studied bandits with non-concave reward. This
work considers a large family of bandit problems where the unknown underlying
reward function is non-concave, including the low-rank generalized linear
bandit problems and two-layer neural network with polynomial activation bandit
problem. For the low-rank generalized linear bandit problem, we provide a
minimax-optimal algorithm in the dimension, refuting both conjectures in
[LMT21, JWWN19]. Our algorithms are based on a unified zeroth-order
optimization paradigm that applies in great generality and attains optimal
rates in several structured polynomial settings (in the dimension). We further
demonstrate the applicability of our algorithms in RL in the generative model
setting, resulting in improved sample complexity over prior approaches.
Finally, we show that the standard optimistic algorithms (e.g., UCB) are
sub-optimal by dimension factors. In the neural net setting (with polynomial
activation functions) with noiseless reward, we provide a bandit algorithm with
sample complexity equal to the intrinsic algebraic dimension. Again, we show
that optimistic approaches have worse sample complexity, polynomial in the
extrinsic dimension (which could be exponentially worse in the polynomial
degree).

本文研究非凸奖励的赌博机问题，提出了一种适用于一类具有非凸奖励函数的赌博机算法，通过统一的零阶优化范式达到了多项式设置下的最优速率，并在生成模型的 RL 中实现了算法的应用，从而取得了比之前方法更好的样本复杂度。