Many interesting real world domains involve reinforcement learning (RL) in
partially observable environments. Efficient learning in such domains is
important, but existing sample complexity bounds for partially observable RL
are at least exponential in the episode length. We give, to our knowledge, the
first partially observable RL algorithm with a polynomial bound on the number
of episodes on which the algorithm may not achieve near-optimal performance.
Our algorithm is suitable for an important class of episodic POMDPs. Our
approach builds on recent advances in method of moments for latent variable
model estimation.

本文研究了部分可观的强化学习问题，并提出了首个具有多项式边界的算法，用于处理一类重要的 POMDP 问题，该算法基于最近的方法学方法来估计潜在变量模型。

一种适用于情节式 POMDP 的 PAC RL 算法

A PAC RL Algorithm for Episodic POMDPs

Recently, there has been significant progress in understanding reinforcement
learning in discounted infinite-horizon Markov decision processes (MDPs) by
deriving tight sample complexity bounds. However, in many real-world
applications, an interactive learning agent operates for a fixed or bounded
period of time, for example tutoring students for exams or handling customer
service requests. Such scenarios can often be better treated as episodic
fixed-horizon MDPs, for which only looser bounds on the sample complexity
exist. A natural notion of sample complexity in this setting is the number of
episodes required to guarantee a certain performance with high probability (PAC
guarantee). In this paper, we derive an upper PAC bound $\tilde
O(\frac{|\mathcal S|^2 |\mathcal A| H^2}{\epsilon^2} \ln\frac 1 \delta)$ and a
lower PAC bound $\tilde \Omega(\frac{|\mathcal S| |\mathcal A| H^2}{\epsilon^2}
\ln \frac 1 {\delta + c})$ that match up to log-terms and an additional linear
dependency on the number of states $|\mathcal S|$. The lower bound is the first
of its kind for this setting. Our upper bound leverages Bernstein's inequality
to improve on previous bounds for episodic finite-horizon MDPs which have a
time-horizon dependency of at least $H^3$.

本文研究了固定时间段内交互式学习智能体的表现，并从样本复杂度的角度提出了上下 PAC 确定性保证边界，为固定时间段内 MDP 的研究提供了理论上的支持。

固定视界强化学习的样本复杂度

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

We derive fundamental sample complexity bounds for recovering sparse and
structured signals for linear and nonlinear observation models including sparse
regression, group testing, multivariate regression and problems with missing
features. In general, sparse signal processing problems can be characterized in
terms of the following Markovian property. We are given a set of $N$ variables
$X_1,X_2,\ldots,X_N$, and there is an unknown subset of variables $S \subset
\{1,\ldots,N\}$ that are relevant for predicting outcomes $Y$. More
specifically, when $Y$ is conditioned on $\{X_n\}_{n\in S}$ it is conditionally
independent of the other variables, $\{X_n\}_{n \not \in S}$. Our goal is to
identify the set $S$ from samples of the variables $X$ and the associated
outcomes $Y$. We characterize this problem as a version of the noisy channel
coding problem. Using asymptotic information theoretic analyses, we establish
mutual information formulas that provide sufficient and necessary conditions on
the number of samples required to successfully recover the salient variables.
These mutual information expressions unify conditions for both linear and
nonlinear observations. We then compute sample complexity bounds for the
aforementioned models, based on the mutual information expressions in order to
demonstrate the applicability and flexibility of our results in general sparse
signal processing models.

在本文中，我们使用渐近信息理论分析，为线性和非线性观测模型包括稀疏回归、分组测试、多元回归和存在缺失特征的问题，推导了恢复稀疏和结构化信号的基本样本复杂度界限，为一般稀疏信号处理模型提供了足够和必要的条件。