We consider approximate dynamic programming in $\gamma$-discounted Markov
decision processes and apply it to approximate planning with linear
value-function approximation. Our first contribution is a new variant of
Approximate Policy Iteration (API), called Confident Approximate Policy
Iteration (CAPI), which computes a deterministic stationary policy with an
optimal error bound scaling linearly with the product of the effective horizon
$H$ and the worst-case approximation error $\epsilon$ of the action-value
functions of stationary policies. This improvement over API (whose error scales
with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike
Scherrer and Lesner [2012], who recommended computing a non-stationary policy
to achieve a similar improvement (with the same memory overhead), we are able
to stick to stationary policies. This allows for our second contribution, the
application of CAPI to planning with local access to a simulator and
$d$-dimensional linear function approximation. As such, we design a planning
algorithm that applies CAPI to obtain a sequence of policies with successively
refined accuracies on a dynamically evolving set of states. The algorithm
outputs an $\tilde O(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde
O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the
optimal accuracy bound and the best known query complexity bound, while earlier
algorithms in the literature achieve only one of them. This query complexity is
shown to be tight in all parameters except $H$. These improvements come at the
expense of a mild (polynomial) increase in memory and computational costs of
both the algorithm and its output policy.

论文提出了一种新的拟动态规划算法 Confident Approximate Policy Iteration (CAPI)，并将其应用于以局部模拟器为基础的规划问题中，该算法通过一系列策略来获得越来越精确的结果，在最小代价（内存和计算代价）下输出最优策略，同时该算法的查询复杂度较先进算法有很大的改善。