We propose and study a new model for reinforcement learning with rich
observations, generalizing contextual bandits to sequential decision making.
These models require an agent to take actions based on observations (features)
with the goal of achieving long-term performance competitive with a large set
of policies. To avoid barriers to sample-efficient learning associated with
large observation spaces and general POMDPs, we focus on problems that can be
summarized by a small number of hidden states and have long-term rewards that
are predictable by a reactive function class. In this setting, we design and
analyze a new reinforcement learning algorithm, Least Squares Value Elimination
by Exploration. We prove that the algorithm learns near optimal behavior after
a number of episodes that is polynomial in all relevant parameters, logarithmic
in the number of policies, and independent of the size of the observation
space. Our result provides theoretical justification for reinforcement learning
with function approximation.

本研究提出一种新的强化学习模型，将上下文逐步演化到顺序决策制定，通过分析最小二乘值淘汰算法表明，在某些特定情形，强化学习方法的范数较优行为可以在多项式时间内学习。