Recently, there are many efforts attempting to learn useful policies for
continuous control in visual reinforcement learning (RL). In this scenario, it
is important to learn a generalizable policy, as the testing environment may
differ from the training environment, e.g., there exist distractors during
deployment. Many practical algorithms are proposed to handle this problem.
However, to the best of our knowledge, none of them provide a theoretical
understanding of what affects the generalization gap and why their proposed
methods work. In this paper, we bridge this issue by theoretically answering
the key factors that contribute to the generalization gap when the testing
environment has distractors. Our theories indicate that minimizing the
representation distance between training and testing environments, which aligns
with human intuition, is the most critical for the benefit of reducing the
generalization gap. Our theoretical results are supported by the empirical
evidence in the DMControl Generalization Benchmark (DMC-GB).

通过理论上回答测试环境存在干扰因素时造成泛化差距的关键因素，我们的研究论文弥合了这一问题，指出在训练和测试环境之间尽量减小表示差距是最关键的，这与人类直觉相吻合。我们的理论结果得到了 DMControl 泛化基准测试 (DMC-GB) 的实证证据支持。

视觉强化学习中影响泛化差距的因素的理论与实证研究

Understanding What Affects Generalization Gap in Visual Reinforcement  Learning: Theory and Empirical Evidence

We present Placeto, a reinforcement learning (RL) approach to efficiently
find device placements for distributed neural network training. Unlike prior
approaches that only find a device placement for a specific computation graph,
Placeto can learn generalizable device placement policies that can be applied
to any graph. We propose two key ideas in our approach: (1) we represent the
policy as performing iterative placement improvements, rather than outputting a
placement in one shot; (2) we use graph embeddings to capture relevant
information about the structure of the computation graph, without relying on
node labels for indexing. These ideas allow Placeto to train efficiently and
generalize to unseen graphs. Our experiments show that Placeto requires up to
6.1x fewer training steps to find placements that are on par with or better
than the best placements found by prior approaches. Moreover, Placeto is able
to learn a generalizable placement policy for any given family of graphs, which
can then be used without any retraining to predict optimized placements for
unseen graphs from the same family. This eliminates the large overhead incurred
by prior RL approaches whose lack of generalizability necessitates re-training
from scratch every time a new graph is to be placed.

本文提出了一个名为 Placeto 的强化学习方法，用于高效地找到分布式神经网络训练的设备位置，并且可以学习通用的设备放置策略，这种策略可以应用于任何计算图，并且实验结果表明，使用 Placeto 可以找到与现有方法找到的最佳放置相当或更优的放置，并且可以在同一族的图中实现无需重新训练来预测优化放置，从而消除了其他强化学习方法带来的大量开销。