Training large neural networks is known to be time-consuming, with the
learning duration taking days or even weeks. To address this problem,
large-batch optimization was introduced. This approach demonstrated that
scaling mini-batch sizes with appropriate learning rate adjustments can speed
up the training process by orders of magnitude. While long training time was
not typically a major issue for model-free deep offline RL algorithms, recently
introduced Q-ensemble methods achieving state-of-the-art performance made this
issue more relevant, notably extending the training duration. In this work, we
demonstrate how this class of methods can benefit from large-batch
optimization, which is commonly overlooked by the deep offline RL community. We
show that scaling the mini-batch size and naively adjusting the learning rate
allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of
out-of-distribution actions, and (3) improved convergence time, effectively
shortening training duration by 3-4x times on average.

本研究在深度离线强化学习方法中探讨了大批量优化的应用，提出采用适当的学习率调整和小批量缩放的方法，可以明显地加快模型训练速度，从而在控制 Q-ensemble 数量、强化对分布外行为的惩罚力度和提高收敛速度等方面产生了积极的效应。

离线强化学习中的 Q-Ensemble 方法：不是扩大模型规模而是扩大训练批次

Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size

In recent years, large pre-trained Transformer-based language models have led
to dramatic improvements in many natural language understanding tasks. To train
these models with increasing sizes, many neural network practitioners attempt
to increase the batch sizes in order to leverage multiple GPUs to improve
training speed. However, increasing the batch size often makes the optimization
more difficult, leading to slow convergence or poor generalization that can
require orders of magnitude more training time to achieve the same model
quality. In this paper, we explore the steepness of the loss landscape of
large-batch optimization for adapting pre-trained Transformer-based language
models to domain-specific tasks and find that it tends to be highly complex and
irregular, posing challenges to generalization on downstream tasks.
To tackle this challenge, we propose ScaLA, a novel and efficient method to
accelerate the adaptation speed of pre-trained transformer networks. Different
from prior methods, we take a sequential game-theoretic approach by adding
lightweight adversarial noise into large-batch optimization, which
significantly improves adaptation speed while preserving model generalization.
Experiment results show that ScaLA attains 2.7--9.8$\times$ adaptation speedups
over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving
comparable and sometimes higher accuracy than the state-of-the-art large-batch
optimization methods. Finally, we also address the theoretical aspect of
large-batch optimization with adversarial noise and provide a theoretical
convergence rate analysis for ScaLA using techniques for analyzing non-convex
saddle-point problems.

通过加入轻量级对抗噪声到大规模优化中，我们提出了 ScaLA 方法，可以加速预训练 transformer 网络的自适应速度，并在保持模型概括能力的同时，取得了与最先进的大批量优化方法相当甚至更高的准确性。

ScaLA: 通过高效的大批量对抗性噪声加速预训练的基于 Transformer 的语言模型的适应性

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

Optimizing multiple competing black-box objectives is a challenging problem
in many fields, including science, engineering, and machine learning.
Multi-objective Bayesian optimization (MOBO) is a sample-efficient approach for
identifying the optimal trade-offs between the objectives. However, many
existing methods perform poorly when the observations are corrupted by noise.
We propose a novel acquisition function, NEHVI, that overcomes this important
practical limitation by applying a Bayesian treatment to the popular expected
hypervolume improvement (EHVI) criterion and integrating over this uncertainty
in the Pareto frontier. We argue that, even in the noiseless setting,
generating multiple candidates in parallel is an incarnation of EHVI with
uncertainty in the Pareto frontier and therefore can be addressed using the
same underlying technique. Through this lens, we derive a natural parallel
variant, $q$NEHVI, that reduces computational complexity of parallel EHVI from
exponential to polynomial with respect to the batch size. $q$NEHVI is one-step
Bayes-optimal for hypervolume maximization in both noisy and noiseless
environments, and we show that it can be optimized effectively with
gradient-based methods via sample average approximation. Empirically, we
demonstrate not only that $q$NEHVI is substantially more robust to observation
noise than existing MOBO approaches, but also that it achieves state-of-the-art
optimization performance and competitive wall-times in large-batch
environments.

提出了一种新的采样函数 NEHVI，它通过在 Pareto 前沿上集成期望超体积改进准则的不确定性，从而克服了噪声干扰的限制，特别是在大批量优化环境中表现出卓越的性能。