We consider the problem of sampling from a discrete and structured
distribution as a sequential decision problem, where the objective is to find a
stochastic policy such that objects are sampled at the end of this sequential
process proportionally to some predefined reward. While we could use maximum
entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some
distributions, it has been shown that in general, the distribution over states
induced by the optimal policy may be biased in cases where there are multiple
ways to generate the same object. To address this issue, Generative Flow
Networks (GFlowNets) learn a stochastic policy that samples objects
proportionally to their reward by approximately enforcing a conservation of
flows across the whole Markov Decision Process (MDP). In this paper, we extend
recent methods correcting the reward in order to guarantee that the marginal
distribution induced by the optimal MaxEnt RL policy is proportional to the
original reward, regardless of the structure of the underlying MDP. We also
prove that some flow-matching objectives found in the GFlowNet literature are
in fact equivalent to well-established MaxEnt RL algorithms with a corrected
reward. Finally, we study empirically the performance of multiple MaxEnt RL and
GFlowNet algorithms on multiple problems involving sampling from discrete
distributions.

通过在整个马尔可夫决策过程中近似强制执行流的守恒，我们扩展了最近的方法来纠正奖励，以确保最优最大熵强化学习策略引发的边缘分布与原始奖励成比例。

多路径环境中的离散概率推断作为控制

Discrete Probabilistic Inference as Control in Multi-path Environments

Maximum entropy (MaxEnt) RL maximizes a combination of the original task
reward and an entropy reward. It is believed that the regularization imposed by
entropy, on both policy improvement and policy evaluation, together contributes
to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting
various ablation studies on soft actor-critic (SAC), a popular representative
of MaxEnt RL. Our findings reveal that in general, entropy rewards should be
applied with caution to policy evaluation. On one hand, the entropy reward,
like any other intrinsic reward, could obscure the main task reward if it is
not properly managed. We identify some failure cases of the entropy reward
especially in episodic Markov decision processes (MDPs), where it could cause
the policy to be overly optimistic or pessimistic. On the other hand, our
large-scale empirical study shows that using entropy regularization alone in
policy improvement, leads to comparable or even better performance and
robustness than using it in both policy improvement and policy evaluation.
Based on these observations, we recommend either normalizing the entropy reward
to a zero mean (SACZero), or simply removing it from policy evaluation
(SACLite) for better practical results.

本文研究熵作为内在奖励的效果，并在一种普遍的 MaxEnt RL 方法 —— 软性演员 - 评论家（SAC）中进行各种消融研究。我们发现熵奖励应谨慎用于策略评估，并且仅使用熵正则化来进行策略改进可获得可比甚至更好的性能和鲁棒性。因此，我们建议要么将熵奖励归一化为零平均值（SACZero），要么仅仅从策略评估中删除它（SACLite）以获得更好的实际结果。

实践中是否需要熵奖励？

Do You Need the Entropy Reward (in Practice)?

Experimentally, it has been observed that humans and animals often make
decisions that do not maximize their expected utility, but rather choose
outcomes randomly, with probability proportional to expected utility.
Probability matching, as this strategy is called, is equivalent to maximum
entropy reinforcement learning (MaxEnt RL). However, MaxEnt RL does not
optimize expected utility. In this paper, we formally show that MaxEnt RL does
optimally solve certain classes of control problems with variability in the
reward function. In particular, we show (1) that MaxEnt RL can be used to solve
a certain class of POMDPs, and (2) that MaxEnt RL is equivalent to a two-player
game where an adversary chooses the reward function. These results suggest a
deeper connection between MaxEnt RL, robust control, and POMDPs, and provide
insight for the types of problems for which we might expect MaxEnt RL to
produce effective solutions. Specifically, our results suggest that domains
with uncertainty in the task goal may be especially well-suited for MaxEnt RL
methods.

本文阐述了最大熵强化学习方法在解决某些具有奖励函数变异的控制问题中的优化作用，该方法还可以解决部分可观察马尔可夫决策过程且与双方博弈等效，其可以提供一定的洞见，指出在任务目标具有不确定性的领域中最大熵强化学习方法特别适用。