$Q$-learning algorithms are appealing for real-world applications due to
their data-efficiency, but they are very prone to overfitting and training
instabilities when trained from visual observations. Prior work, namely SVEA,
finds that selective application of data augmentation can improve the visual
generalization of RL agents without destabilizing training. We revisit its
recipe for data augmentation, and find an assumption that limits its
effectiveness to augmentations of a photometric nature. Addressing these
limitations, we propose a generalized recipe, SADA, that works with wider
varieties of augmentations. We benchmark its effectiveness on DMC-GB2 -- our
proposed extension of the popular DMControl Generalization Benchmark -- as well
as tasks from Meta-World and the Distracting Control Suite, and find that our
method, SADA, greatly improves training stability and generalization of RL
agents across a diverse set of augmentations. Visualizations, code, and
benchmark: see this https URL

通过数据增强的广义方法 SADA，可以提高 Q - 学习算法在视觉观察训练中的稳定性和泛化能力，适用于各种数据增强方式。

视觉强化学习中无界数据增强的配方

A Recipe for Unbounded Data Augmentation in Visual Reinforcement  Learning

This study aims to introduce the cell load estimation problem of cell
switching approaches in cellular networks specially-presented in a
high-altitude platform station (HAPS)-assisted network. The problem arises from
the fact that the traffic loads of sleeping base stations for the next time
slot cannot be perfectly known, but they can rather be estimated, and any
estimation error could result in divergence from the optimal decision, which
subsequently affects the performance of energy efficiency. The traffic loads of
the sleeping base stations for the next time slot are required because the
switching decisions are made proactively in the current time slot. Two
different Q-learning algorithms are developed; one is full-scale, focusing
solely on the performance, while the other one is lightweight and addresses the
computational cost. Results confirm that the estimation error is capable of
changing cell switching decisions that yields performance divergence compared
to no-error scenarios. Moreover, the developed Q-learning algorithms perform
well since an insignificant difference (i.e., 0.3%) is observed between them
and the optimum algorithm.

通过引入 HAPS 辅助网络中的基站切换方法中的小区负载估计问题，本研究旨在解决基站切换决策中的负载估计问题，以提高能量效率。研究表明，估计误差可以改变切换决策，并导致性能差异。此外，开发的 Q 学习算法表现良好，与最优算法之间仅有微小差异（0.3%）。

高空卫星通信网络中的小区切换：流量负载的不可见性如何影响决策

Cell Switching in HAPS-Aided Networking: How the Obscurity of Traffic  Loads Affects the Decision

There are only a few learning algorithms applicable to stochastic dynamic
teams and games which generalize Markov decision processes to decentralized
stochastic control problems involving possibly self-interested decision makers.
Learning in games is generally difficult because of the non-stationary
environment in which each decision maker aims to learn its optimal decisions
with minimal information in the presence of the other decision makers who are
also learning. In stochastic dynamic games, learning is more challenging
because, while learning, the decision makers alter the state of the system and
hence the future cost. In this paper, we present decentralized Q-learning
algorithms for stochastic games, and study their convergence for the weakly
acyclic case which includes team problems as an important special case. The
algorithm is decentralized in that each decision maker has access to only its
local information, the state information, and the local cost realizations;
furthermore, it is completely oblivious to the presence of other decision
makers. We show that these algorithms converge to equilibrium policies almost
surely in large classes of stochastic games.

本文介绍了用于随机动态团队和游戏的分散 Q 学习算法，研究了其在包括团队问题在内的弱无环情况下的收敛性。