We propose a novel randomized linear programming algorithm for approximating
the optimal policy of the discounted Markov decision problem. By leveraging the
value-policy duality and binary-tree data structures, the algorithm adaptively
samples state-action-state transitions and makes exponentiated primal-dual
updates. We show that it finds an $\epsilon$-optimal policy using nearly-linear
run time in the worst case. When the Markov decision process is ergodic and
specified in some special data formats, the algorithm finds an
$\epsilon$-optimal policy using run time linear in the total number of
state-action pairs, which is sublinear in the input size. These results provide
a new venue and complexity benchmarks for solving stochastic dynamic programs.

提出一种新的随机线性规划算法，利用价值 - 策略对偶和二叉树数据结构，自适应地采样状态 - 动作 - 状态转移，并进行指数原始 - 对偶更新，从而以几乎线性的运行时间在最坏情况下找到一个 ε- 最优策略。当马尔可夫决策过程是遍历的并且以某些特殊的数据格式指定时，该算法使用线性的运行时间，在状态 - 动作对的总数中是次线性的，为解决随机动态规划问题提供了新的途径和复杂性基准。