We study the sparse entropy-regularized reinforcement learning (ERL) problem
in which the entropy term is a special form of the Tsallis entropy. The optimal
policy of this formulation is sparse, i.e.,~at each state, it has non-zero
probability for only a small number of actions. This addresses the main
drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation,
in which the optimal policy is softmax, and thus, may assign a non-negligible
probability mass to non-optimal actions. This problem is aggravated as the
number of actions is increased. In this paper, we follow the work of Nachum et
al. (2017) in the soft ERL setting, and propose a class of novel path
consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse
ERL problem that can work with both on-policy and off-policy data. We first
derive a {\em sparse consistency} equation that specifies a relationship
between the optimal value function and policy of the sparse ERL along any
system trajectory. Crucially, a weak form of the converse is also true, and we
quantify the sub-optimality of a policy which satisfies sparse consistency, and
show that as we increase the number of actions, this sub-optimality is better
than that of the soft ERL optimal policy. We then use this result to derive the
sparse PCL algorithms. We empirically compare sparse PCL with its soft
counterpart, and show its advantage, especially in problems with a large number
of actions.

本研究研究了稀疏熵正则化强化学习问题，提出了一种新颖的路径一致性学习算法，名为 “稀疏 PCL”，并证明它比标准的 Shannon 熵正则化 RL（软 ERL）问题更优，特别是在动作数量较多的情况下。

Tsallis 熵正则化 MDP 中的路径一致性学习

Path Consistency Learning in Tsallis Entropy Regularized MDPs

We study the problem of decision-theoretic online learning (DTOL). Motivated
by practical applications, we focus on DTOL when the number of actions is very
large. Previous algorithms for learning in this framework have a tunable
learning rate parameter, and a barrier to using online-learning in practical
applications is that it is not understood how to set this parameter optimally,
particularly when the number of actions is large.
In this paper, we offer a clean solution by proposing a novel and completely
parameter-free algorithm for DTOL. We introduce a new notion of regret, which
is more natural for applications with a large number of actions. We show that
our algorithm achieves good performance with respect to this new notion of
regret; in addition, it also achieves performance close to that of the best
bounds achieved by previous algorithms with optimally-tuned parameters,
according to previous notions of regret.

本文聚焦于大量行动决策问题的决策理论在线学习（DTOL）。我们提出了一种全新无需参数的算法用于 DTOL，这解决了在线学习因无法在实际中优化设定学习率参数的困境。此外，我们引入了一种新的错误度量标准，该算法可以在此标准和以前的标准下实现优异的表现，接近以前有优化参数的最佳预算。