We propose randomized least-squares value iteration (RLSVI) -- a new
reinforcement learning algorithm designed to explore and generalize efficiently
via linearly parameterized value functions. We explain why versions of
least-squares value iteration that use Boltzmann or epsilon-greedy exploration
can be highly inefficient, and we present computational results that
demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, we establish
an upper bound on the expected regret of RLSVI that demonstrates
near-optimality in a tabula rasa learning context. More broadly, our results
suggest that randomized value functions offer a promising approach to tackling
a critical challenge in reinforcement learning: synthesizing efficient
exploration and effective generalization.

本文提出了一种新的 RL 算法 RLSVI，针对线性参数化的价值函数进行探索和泛化，相较于 Boltzmann 或 epsilon-greedy 探索，RLSVI 实现了显著的效率提高，并在 tabula rasa 的学习环境下展现出接近最优的表现，研究表明随机化的价值函数是解决增强学习中有效探索和泛化的关键所在。

通过随机化价值函数实现泛化和探索

Generalization and Exploration via Randomized Value Functions

Most provably-efficient learning algorithms introduce optimism about
poorly-understood states and actions to encourage exploration. We study an
alternative approach for efficient exploration, posterior sampling for
reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of
known duration. At the start of each episode, PSRL updates a prior distribution
over Markov decision processes and takes one sample from this posterior. PSRL
then follows the policy that is optimal for this sample during the episode. The
algorithm is conceptually simple, computationally efficient and allows an agent
to encode prior knowledge in a natural way. We establish an $\tilde{O}(\tau S
\sqrt{AT})$ bound on the expected regret, where $T$ is time, $\tau$ is the
episode length and $S$ and $A$ are the cardinalities of the state and action
spaces. This bound is one of the first for an algorithm not based on optimism,
and close to the state of the art for any reinforcement learning algorithm. We
show through simulation that PSRL significantly outperforms existing algorithms
with similar regret bounds.

该研究提出了一种用于强化学习的后验采样方法（PSRL），通过对一个先验分布进行贝叶斯更新来在已知的一系列时段内实现对 Markov 决策过程的优化，从而达到高效的探索。该算法在时间，状态和行动空间上有明显的性能优势，并具有一定的先验知识编码能力。