Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are \emph{exponential} in the horizon.

本研究解决了在大规模或无限状态和动作空间中设计高效样本和计算合理的强化学习算法的难题。我们提出了一种新算法，能够在给定特征映射下高效寻找近似最优策略，并在问题参数上呈多项式级别使用样本和成本敏感分类oracle。这一算法显著提升了现有方法的效能，尤其在处理无限状态和动作环境时，具有重要应用潜力。

针对具有线性可实现价值函数的MDP的样本和oracle高效强化学习