With the impact of real-time processing being realized in the recent past,
the need for efficient implementations of reinforcement learning algorithms has
been on the rise. Albeit the numerous advantages of Bellman equations utilized
in RL algorithms, they are not without the large search space of design
parameters.
This research aims to shed light on the design space exploration associated
with reinforcement learning parameters, specifically that of Policy Iteration.
Given the large computational expenses of fine-tuning the parameters of
reinforcement learning algorithms, we propose an auto-tuner-based ordinal
regression approach to accelerate the process of exploring these parameters
and, in return, accelerate convergence towards an optimal policy. Our approach
provides 1.82x peak speedup with an average of 1.48x speedup over the previous
state-of-the-art.

该研究针对强化学习参数的设计空间进行了设计空间探索，提出了基于自动调谐器的序数回归方法，可以加速收敛并实现 1.82 倍的峰值加速度和 1.48 倍的平均加速度。

基于强化学习的路径规划：一种策略迭代方法

Path Planning using Reinforcement Learning: A Policy Iteration Approach

Finite-horizon sequential experimental design (SED) arises naturally in many
contexts, including hyperparameter tuning in machine learning among more
traditional settings. Computing the optimal policy for such problems requires
solving Bellman equations, which are generally intractable. Most existing work
resorts to severely myopic approximations by limiting the decision horizon to
only a single time-step, which can underweight exploration in favor of
exploitation. We present BINOCULARS: Batch-Informed NOnmyopic Choices, Using
Long-horizons for Adaptive, Rapid SED, a general framework for deriving
efficient, nonmyopic approximations to the optimal experimental policy. Our key
idea is simple and surprisingly effective: we first compute a one-step optimal
batch of experiments, then select a single point from this batch to evaluate.
We realize BINOCULARS for Bayesian optimization and Bayesian quadrature -- two
notable SED problems with radically different objectives -- and demonstrate
that BINOCULARS significantly outperforms myopic alternatives in real-world
scenarios.

该研究提出了一个基于贝叶斯优化的序列实验设计的新框架 ——BINOCULARS，它可以更有效，更准确地计算实验的最佳方案。

高效非近视序贯实验设计的双筒望远镜

BINOCULARS for Efficient, Nonmyopic Sequential Experimental Design

We study the online estimation of the optimal policy of a Markov decision
process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which
exploit the inherent minimax duality of Bellman equations. The SPD methods
update a few coordinates of the value and policy estimates as a new state
transition is observed. These methods use small storage and has low
computational complexity per iteration. The SPD methods find an
absolute-$\epsilon$-optimal policy, with high probability, using
$\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2\sigma^2
}{(1-\gamma)^6\epsilon^2} \right)$ iterations/samples for the infinite-horizon
discounted-reward MDP and $\mathcal{O}\left(\frac{|\mathcal{S}|^4
|\mathcal{A}|^2H^6\sigma^2 }{\epsilon^2} \right)$ for the finite-horizon MDP.

本文研究了马尔可夫决策过程 (MDP) 的最优策略在线估计问题，并提出了一类基于随机原始对偶法的方法，利用 Bellman 方程的内在极小极大对偶性进行优化。 这些方法具有小的存储空间和低的计算复杂度，通过观察新的状态转移更新值和策略估计的少数坐标。 对于无限时间折扣奖励 MDP，这些 SPD 方法使用 O (|S|^4 |A|^2σ^2/(1-γ)^6ε^2) 的迭代 / 样本数可以高概率地找到绝对 ε- 最优策略，对于有限时间 MDP，迭代次数为 O (|S|^4 |A|^2H^6σ^2/ε^2)。