We consider the problem of sampling from a discrete and structured
distribution as a sequential decision problem, where the objective is to find a
stochastic policy such that objects are sampled at the end of this sequential
process proportionally to some predefined reward. While we could use maximum
entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some
distributions, it has been shown that in general, the distribution over states
induced by the optimal policy may be biased in cases where there are multiple
ways to generate the same object. To address this issue, Generative Flow
Networks (GFlowNets) learn a stochastic policy that samples objects
proportionally to their reward by approximately enforcing a conservation of
flows across the whole Markov Decision Process (MDP). In this paper, we extend
recent methods correcting the reward in order to guarantee that the marginal
distribution induced by the optimal MaxEnt RL policy is proportional to the
original reward, regardless of the structure of the underlying MDP. We also
prove that some flow-matching objectives found in the GFlowNet literature are
in fact equivalent to well-established MaxEnt RL algorithms with a corrected
reward. Finally, we study empirically the performance of multiple MaxEnt RL and
GFlowNet algorithms on multiple problems involving sampling from discrete
distributions.

通过在整个马尔可夫决策过程中近似强制执行流的守恒，我们扩展了最近的方法来纠正奖励，以确保最优最大熵强化学习策略引发的边缘分布与原始奖励成比例。

多路径环境中的离散概率推断作为控制

Discrete Probabilistic Inference as Control in Multi-path Environments

Large Neighborhood Search (LNS) is a popular heuristic for solving
combinatorial optimization problems. LNS iteratively explores the neighborhoods
in solution spaces using destroy and repair operators. Determining the best
operators for LNS to solve a problem at hand is a labor-intensive process.
Hence, Adaptive Large Neighborhood Search (ALNS) has been proposed to
adaptively select operators during the search process based on operator
performances of the previous search iterations. Such an operator selection
procedure is a heuristic, based on domain knowledge, which is ineffective with
complex, large solution spaces. In this paper, we address the problem of
selecting operators for each search iteration of ALNS as a sequential decision
problem and propose a Deep Reinforcement Learning based method called Deep
Reinforced Adaptive Large Neighborhood Search. As such, the proposed method
aims to learn based on the state of the search which operation to select to
obtain a high long-term reward, i.e., a good solution to the underlying
optimization problem. The proposed method is evaluated on a time-dependent
orienteering problem with stochastic weights and time windows. Results show
that our approach effectively learns a strategy that adaptively selects
operators for large neighborhood search, obtaining competitive results compared
to a state-of-the-art machine learning approach while trained with much fewer
observations on small-sized problem instances.

本文提出了一种基于深度强化学习算法的算子选择方法，旨在提高 Adaptive Large Neighborhood Search（ALNS）算法的解题质量，实证结果表明，与基于机器学习的算法相比，所述方法训练次数更少，对小型问题与大型问题都能有效提高问题解决效率。

利用深度强化学习进行自适应大邻域搜索的运算符选择

Operator Selection in Adaptive Large Neighborhood Search using Deep Reinforcement Learning

Recommender systems (RSs) have become an inseparable part of our everyday
lives. They help us find our favorite items to purchase, our friends on social
networks, and our favorite movies to watch. Traditionally, the recommendation
problem was considered to be a classification or prediction problem, but it is
now widely agreed that formulating it as a sequential decision problem can
better reflect the user-system interaction. Therefore, it can be formulated as
a Markov decision process (MDP) and be solved by reinforcement learning (RL)
algorithms. Unlike traditional recommendation methods, including collaborative
filtering and content-based filtering, RL is able to handle the sequential,
dynamic user-system interaction and to take into account the long-term user
engagement. Although the idea of using RL for recommendation is not new and has
been around for about two decades, it was not very practical, mainly because of
scalability problems of traditional RL algorithms. However, a new trend has
emerged in the field since the introduction of deep reinforcement learning
(DRL), which made it possible to apply RL to the recommendation problem with
large state and action spaces. In this paper, a survey on reinforcement
learning based recommender systems (RLRSs) is presented. Our aim is to present
an outlook on the field and to provide the reader with a fairly complete
knowledge of key concepts of the field. We first recognize and illustrate that
RLRSs can be generally classified into RL- and DRL-based methods. Then, we
propose an RLRS framework with four components, i.e., state representation,
policy optimization, reward formulation, and environment building, and survey
RLRS algorithms accordingly. We highlight emerging topics and depict important
trends using various graphs and tables. Finally, we discuss important aspects
and challenges that can be addressed in the future.

本文对基于强化学习的推荐系统进行了综述，提出了一个 RLRS 框架，包括状态表示，策略优化，奖励制定和环境构建，并针对 RLRS 算法进行了调查，强调出现的主题并展示了各种图表。

强化学习推荐系统综述

Reinforcement learning based recommender systems: A survey

Hyperparameter tuning is an omnipresent problem in machine learning as it is
an integral aspect of obtaining the state-of-the-art performance for any model.
Most often, hyperparameters are optimized just by training a model on a grid of
possible hyperparameter values and taking the one that performs best on a
validation sample (grid search). More recently, methods have been introduced
that build a so-called surrogate model that predicts the validation loss for a
specific hyperparameter setting, model and dataset and then sequentially select
the next hyperparameter to test, based on a heuristic function of the expected
value and the uncertainty of the surrogate model called acquisition function
(sequential model-based Bayesian optimization, SMBO).
In this paper we model the hyperparameter optimization problem as a
sequential decision problem, which hyperparameter to test next, and address it
with reinforcement learning. This way our model does not have to rely on a
heuristic acquisition function like SMBO, but can learn which hyperparameters
to test next based on the subsequent reduction in validation loss they will
eventually lead to, either because they yield good models themselves or because
they allow the hyperparameter selection policy to build a better surrogate
model that is able to choose better hyperparameters later on. Experiments on a
large battery of 50 data sets demonstrate that our method outperforms the
state-of-the-art approaches for hyperparameter learning.

本论文将超参数优化问题建模为一系列决策问题，并用强化学习方法来解决，通过优化选择下一个待优化的超参数，从而提高模型性能。在 50 个数据集上的实验表明，该方法优于目前超参数学习领域的其他方法。