In this paper we consider the problem of obtaining sharp bounds for the
performance of temporal difference (TD) methods with linear functional
approximation for policy evaluation in discounted Markov Decision Processes. We
show that a simple algorithm with a universal and instance-independent step
size together with Polyak-Ruppert tail averaging is sufficient to obtain
near-optimal variance and bias terms. We also provide the respective sample
complexity bounds. Our proof technique is based on refined error bounds for
linear stochastic approximation together with the novel stability result for
the product of random matrices that arise from the TD-type recurrence.

评估折扣马尔可夫决策过程中，使用线性函数逼近的时序差异 (TD) 方法的性能限界，我们证明，使用通用且独立于实例的步长算法，结合 Polyak-Ruppert 尾部平均，可以获得接近最优的方差和偏差项，同时给出了相应的样本复杂性限界。

时差学习的有限样本分析

Finite-Sample Analysis of the Temporal Difference Learning

This paper studies the risk-averse mean-variance optimization in
infinite-horizon discounted Markov decision processes (MDPs). The involved
variance metric concerns reward variability during the whole process, and
future deviations are discounted to their present values. This discounted
mean-variance optimization yields a reward function dependent on a discounted
mean, and this dependency renders traditional dynamic programming methods
inapplicable since it suppresses a crucial property -- time consistency. To
deal with this unorthodox problem, we introduce a pseudo mean to transform the
untreatable MDP to a standard one with a redefined reward function in standard
form and derive a discounted mean-variance performance difference formula. With
the pseudo mean, we propose a unified algorithm framework with a bilevel
optimization structure for the discounted mean-variance optimization. The
framework unifies a variety of algorithms for several variance-related problems
including, but not limited to, risk-averse variance and mean-variance
optimizations in discounted and average MDPs. Furthermore, the convergence
analyses missing from the literature can be complemented with the proposed
framework as well. Taking the value iteration as an example, we develop a
discounted mean-variance value iteration algorithm and prove its convergence to
a local optimum with the aid of a Bellman local-optimality equation. Finally,
we conduct a numerical experiment on portfolio management to validate the
proposed algorithm.

用假均值将混合风险下的 MDP 转化为标准 MDP，并提出一种基于二级优化结构的统一算法框架，该框架还允许收敛性分析。通过数值实验，验证了该算法的有效性。

折扣马尔可夫决策过程中均值 - 方差优化的统一算法框架

A unified algorithm framework for mean-variance optimization in discounted Markov decision processes

Ye showed recently that the simplex method with Dantzig pivoting rule, as
well as Howard's policy iteration algorithm, solve discounted Markov decision
processes (MDPs), with a constant discount factor, in strongly polynomial time.
More precisely, Ye showed that both algorithms terminate after at most
$O(\frac{mn}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations, where $n$ is the
number of states, $m$ is the total number of actions in the MDP, and
$0<\gamma<1$ is the discount factor. We improve Ye's analysis in two respects.
First, we improve the bound given by Ye and show that Howard's policy iteration
algorithm actually terminates after at most
$O(\frac{m}{1-\gamma}\log(\frac{n}{1-\gamma}))$ iterations. Second, and more
importantly, we show that the same bound applies to the number of iterations
performed by the strategy iteration (or strategy improvement) algorithm, a
generalization of Howard's policy iteration algorithm used for solving 2-player
turn-based stochastic games with discounted zero-sum rewards. This provides the
first strongly polynomial algorithm for solving these games, resolving a long
standing open problem.

本文利用改进的单纯形法、策略迭代算法及策略提升算法的收敛速度，利用最小化操作步数的方法，解决了两人纯策略有限的保底价值为零的零和收益随机博弈的问题。