Reinforcement learning (RL) agents are commonly trained and evaluated in the
same environment. In contrast, humans often train in a specialized environment
before being evaluated, such as studying a book before taking an exam. The
potential of such specialized training environments is still vastly
underexplored, despite their capacity to dramatically speed up training.
The framework of synthetic environments takes a first step in this direction
by meta-learning neural network-based Markov decision processes (MDPs). The
initial approach was limited to toy problems and produced environments that did
not transfer to unseen RL algorithms. We extend this approach in three ways:
Firstly, we modify the meta-learning algorithm to discover environments
invariant towards hyperparameter configurations and learning algorithms.
Secondly, by leveraging hardware parallelism and introducing a curriculum on an
agent's evaluation episode horizon, we can achieve competitive results on
several challenging continuous control problems. Thirdly, we surprisingly find
that contextual bandits enable training RL agents that transfer well to their
evaluation environment, even if it is a complex MDP. Hence, we set up our
experiments to train synthetic contextual bandits, which perform on par with
synthetic MDPs, yield additional insights into the evaluation environment, and
can speed up downstream applications.

通过元学习神经网络马尔可夫决策过程，我们发现专门的训练环境对于训练强化学习智能体具有潜在的速度提升能力，并且发现上下文为基的赌博机能够实现良好的评估环境转移，从而加速下游应用。

发现最小的强化学习环境

Discovering Minimal Reinforcement Learning Environments

Reinforcement learning (RL) has gained popularity in the realm of recommender
systems due to its ability to optimize long-term rewards and guide users in
discovering relevant content. However, the successful implementation of RL in
recommender systems is challenging because of several factors, including the
limited availability of online data for training on-policy methods. This
scarcity requires expensive human interaction for online model training.
Furthermore, the development of effective evaluation frameworks that accurately
reflect the quality of models remains a fundamental challenge in recommender
systems. To address these challenges, we propose a comprehensive framework for
synthetic environments that simulate human behavior by harnessing the
capabilities of large language models (LLMs). We complement our framework with
in-depth ablation studies and demonstrate its effectiveness with experiments on
movie and book recommendations. By utilizing LLMs as synthetic users, this work
introduces a modular and novel framework for training RL-based recommender
systems. The software, including the RL environment, is publicly available.

通过利用大型语言模型（LLMs）模拟人类行为，本研究提出了一个综合框架，用于训练基于强化学习（RL）的推荐系统，并提供了深入的消融研究，通过电影和书籍推荐实验证明了其有效性。

基于 LLM 的推荐系统环境

An LLM-based Recommender System Environment

We introduce Synthetic Environments (SEs) and Reward Networks (RNs),
represented by neural networks, as proxy environment models for training
Reinforcement Learning (RL) agents. We show that an agent, after being trained
exclusively on the SE, is able to solve the corresponding real environment.
While an SE acts as a full proxy to a real environment by learning about its
state dynamics and rewards, an RN is a partial proxy that learns to augment or
replace rewards. We use bi-level optimization to evolve SEs and RNs: the inner
loop trains the RL agent, and the outer loop trains the parameters of the SE /
RN via an evolution strategy. We evaluate our proposed new concept on a broad
range of RL algorithms and classic control environments. In a one-to-one
comparison, learning an SE proxy requires more interactions with the real
environment than training agents only on the real environment. However, once
such an SE has been learned, we do not need any interactions with the real
environment to train new agents. Moreover, the learned SE proxies allow us to
train agents with fewer interactions while maintaining the original task
performance. Our empirical results suggest that SEs achieve this result by
learning informed representations that bias the agents towards relevant states.
Moreover, we find that these proxies are robust against hyperparameter
variation and can also transfer to unseen agents.

这篇论文介绍了一种用于训练 Reinforcement Learning 代理的代理环境模型 ——Synthetic Environments 和 Reward Networks，可以通过双层优化演进 Synthetic Environments 和 Reward Networks。研究结果表明 Synthetic Environments 通过学习到偏向相关状态的信息来为代理提供有用的信息，从而降低了训练新代理所需要的真实环境的交互次数，并且可以抵御超参数变化，具有较强的泛化性。

学习合成环境和奖励网络以进行强化学习

Learning Synthetic Environments and Reward Networks for Reinforcement Learning

What is a good visual representation for autonomous agents? We address this
question in the context of semantic visual navigation, which is the problem of
a robot finding its way through a complex environment to a target object, e.g.
go to the refrigerator. Instead of acquiring a metric semantic map of an
environment and using planning for navigation, our approach learns navigation
policies on top of representations that capture spatial layout and semantic
contextual cues. We propose to using high level semantic and contextual
features including segmentation and detection masks obtained by off-the-shelf
state-of-the-art vision as observations and use deep network to learn the
navigation policy. This choice allows using additional data, from orthogonal
sources, to better train different parts of the model the representation
extraction is trained on large standard vision datasets while the navigation
component leverages large synthetic environments for training. This combination
of real and synthetic is possible because equitable feature representations are
available in both (e.g., segmentation and detection masks), which alleviates
the need for domain adaptation. Both the representation and the navigation
policy can be readily applied to real non-synthetic environments as
demonstrated on the Active Vision Dataset [1]. Our approach gets successfully
to the target in 54% of the cases in unexplored environments, compared to 46%
for non-learning based approach, and 28% for the learning-based baseline.

本研究主要探讨了如何在复杂环境下使用语义视觉导航技术，通过使用实时现成的高级语义和语境特征来训练深度神经网络的方式进行导航决策，并通过将现实和虚拟数据的特征表示结合起来提高模型的学习效果达到更高的导航性能。