Specifying a Reinforcement Learning (RL) task involves choosing a suitable
planning horizon, which is typically modeled by a discount factor. It is known
that applying RL algorithms with a lower discount factor can act as a
regularizer, improving performance in the limited data regime. Yet the exact
nature of this regularizer has not been investigated. In this work, we fill in
this gap. For several Temporal-Difference (TD) learning methods, we show an
explicit equivalence between using a reduced discount factor and adding an
explicit regularization term to the algorithm's loss. Motivated by the
equivalence, we empirically study this technique compared to standard $L_2$
regularization by extensive experiments in discrete and continuous domains,
using tabular and functional representations. Our experiments suggest the
regularization effectiveness is strongly related to properties of the available
data, such as size, distribution, and mixing rate.

本文研究了强化学习算法中的折扣因子对提高性能的影响，并通过实验证明了折扣因子可以作为正则化项，对可用数据的大小、分布和混合率等性质有明显影响。

折扣因子作为增强学习中的正则化器

Discount Factor as a Regularizer in Reinforcement Learning

One of the main obstacles to broad application of reinforcement learning
methods is the parameter sensitivity of our core learning algorithms. In many
large-scale applications, online computation and function approximation
represent key strategies in scaling up reinforcement learning algorithms. In
this setting, we have effective and reasonably well understood algorithms for
adapting the learning-rate parameter, online during learning. Such
meta-learning approaches can improve robustness of learning and enable
specialization to current task, improving learning speed. For
temporal-difference learning algorithms which we study here, there is yet
another parameter, $\lambda$, that similarly impacts learning speed and
stability in practice. Unfortunately, unlike the learning-rate parameter,
$\lambda$ parametrizes the objective function that temporal-difference methods
optimize. Different choices of $\lambda$ produce different fixed-point
solutions, and thus adapting $\lambda$ online and characterizing the
optimization is substantially more complex than adapting the learning-rate
parameter. There are no meta-learning method for $\lambda$ that can achieve (1)
incremental updating, (2) compatibility with function approximation, and (3)
maintain stability of learning under both on and off-policy sampling. In this
paper we contribute a novel objective function for optimizing $\lambda$ as a
function of state rather than time. We derive a new incremental, linear
complexity $\lambda$-adaption algorithm that does not require offline batch
updating or access to a model of the world, and present a suite of experiments
illustrating the practicality of our new algorithm in three different settings.
Taken together, our contributions represent a concrete step towards black-box
application of temporal-difference learning methods in real world problems.

本文提出了一种新的目标函数来优化 lambda，使用基于状态而不是时间的线性复杂度的增量式 lambda 适应算法，并在三个不同的环境中进行了实验，这些贡献是针对在现实世界问题中运用时间差异学习方法的具体步骤。

一种贪心方法用于适应时序差分学习的迹参数

A Greedy Approach to Adapting the Trace Parameter for Temporal  Difference Learning

We consider LSTD($\lambda$), the least-squares temporal-difference algorithm
with eligibility traces algorithm proposed by Boyan (2002). It computes a
linear approximation of the value function of a fixed policy in a large Markov
Decision Process. Under a $\beta$-mixing assumption, we derive, for any value
of $\lambda \in (0,1)$, a high-probability estimate of the rate of convergence
of this algorithm to its limit. We deduce a high-probability bound on the error
of this algorithm, that extends (and slightly improves) that derived by Lazaric
et al. (2012) in the specific case where $\lambda=0$. In particular, our
analysis sheds some light on the choice of $\lambda$ with respect to the
quality of the chosen linear space and the number of samples, that complies
with simulations.

本文考虑 LSTD (λ) 算法，推导出了任意 λ 及 β-mixing 条件下该算法收敛速率的高概率估计及误差的高概率界，探究了 λ 值选择对线性空间质量和样本数的影响。