We study pure exploration with infinitely many bandit arms generated i.i.d.
from an unknown distribution. Our goal is to efficiently select a single high
quality arm whose average reward is, with probability $1-\delta$, within
$\varepsilon$ of being among the top $\eta$-fraction of arms; this is a natural
adaptation of the classical PAC guarantee for infinite action sets. We consider
both the fixed confidence and fixed budget settings, aiming respectively for
minimal expected and fixed sample complexity.
For fixed confidence, we give an algorithm with expected sample complexity
$O\left(\frac{\log (1/\eta)\log (1/\delta)}{\eta\varepsilon^2}\right)$. This is
optimal except for the $\log (1/\eta)$ factor, and the $\delta$-dependence
closes a quadratic gap in the literature. For fixed budget, we show the
asymptotically optimal sample complexity as $\delta\to 0$ is
$c^{-1}\log(1/\delta)\big(\log\log(1/\delta)\big)^2$ to leading order.
Equivalently, the optimal failure probability given exactly $N$ samples decays
as $\exp\big(-cN/\log^2 N\big)$, up to a factor $1\pm o_N(1)$ inside the
exponent. The constant $c$ depends explicitly on the problem parameters
(including the unknown arm distribution) through a certain Fisher information
distance. Even the strictly super-linear dependence on $\log(1/\delta)$ was not
known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM
Journal on Computing 2020).

本文研究了纯探索问题中具有无限多臂的赌博机问题，针对固定置信和固定预算两种情形，提出了两种算法，分别以最小的期望和固定样本复杂度为目标，最终准确选择一个高质量臂，使其平均奖励与前 $η$ 的部分的奖励最大值的差别小于 $ε$，并给出了理论证明。

无限臂老虎机渐进最优纯探索

Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Recently, there has been significant progress in understanding reinforcement
learning in discounted infinite-horizon Markov decision processes (MDPs) by
deriving tight sample complexity bounds. However, in many real-world
applications, an interactive learning agent operates for a fixed or bounded
period of time, for example tutoring students for exams or handling customer
service requests. Such scenarios can often be better treated as episodic
fixed-horizon MDPs, for which only looser bounds on the sample complexity
exist. A natural notion of sample complexity in this setting is the number of
episodes required to guarantee a certain performance with high probability (PAC
guarantee). In this paper, we derive an upper PAC bound $\tilde
O(\frac{|\mathcal S|^2 |\mathcal A| H^2}{\epsilon^2} \ln\frac 1 \delta)$ and a
lower PAC bound $\tilde \Omega(\frac{|\mathcal S| |\mathcal A| H^2}{\epsilon^2}
\ln \frac 1 {\delta + c})$ that match up to log-terms and an additional linear
dependency on the number of states $|\mathcal S|$. The lower bound is the first
of its kind for this setting. Our upper bound leverages Bernstein's inequality
to improve on previous bounds for episodic finite-horizon MDPs which have a
time-horizon dependency of at least $H^3$.

本文研究了固定时间段内交互式学习智能体的表现，并从样本复杂度的角度提出了上下 PAC 确定性保证边界，为固定时间段内 MDP 的研究提供了理论上的支持。