This paper provides an analysis of the tradeoff between asymptotic bias
(suboptimality with unlimited data) and overfitting (additional suboptimality
due to limited data) in the context of reinforcement learning with partial
observability. Our theoretical analysis formally characterizes that while
potentially increasing the asymptotic bias, a smaller state representation
decreases the risk of overfitting. This analysis relies on expressing the
quality of a state representation by bounding L1 error terms of the associated
belief states. Theoretical results are empirically illustrated when the state
representation is a truncated history of observations, both on synthetic POMDPs
and on a large-scale POMDP in the context of smartgrids, with real-world data.
Finally, similarly to known results in the fully observable setting, we also
briefly discuss and empirically illustrate how using function approximators and
adapting the discount factor may enhance the tradeoff between asymptotic bias
and overfitting in the partially observable context.

通过对有限数据情况下的渐近偏差与过拟合的权衡分析，本文探讨了在强化学习中的部分可观测性，通过较小的状态表示减少过拟合的风险，最终通过理论结论及实验结果验证了前述结论。

批量强化学习在部分观测下的过拟合和渐进偏差

On overfitting and asymptotic bias in batch reinforcement learning with  partial observability

Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data
sets is computationally infeasible. The recently proposed stochastic gradient
Langevin dynamics (SGLD) method circumvents this problem in three ways: it
generates proposed moves using only a subset of the data, it skips the
Metropolis-Hastings accept-reject step, and it uses sequences of decreasing
step sizes. In \cite{TehThierryVollmerSGLD2014}, we provided the mathematical
foundations for the decreasing step size SGLD, including consistency and a
central limit theorem. However, in practice the SGLD is run for a relatively
small number of iterations, and its step size is not decreased to zero. The
present article investigates the behaviour of the SGLD with fixed step size. In
particular we characterise the asymptotic bias explicitly, along with its
dependence on the step size and the variance of the stochastic gradient. On
that basis a modified SGLD which removes the asymptotic bias due to the
variance of the stochastic gradients up to first order in the step size is
derived. Moreover, we are able to obtain bounds on the finite-time bias,
variance and mean squared error (MSE). The theory is illustrated with a
Gaussian toy model for which the bias and the MSE for the estimation of moments
can be obtained explicitly. For this toy model we study the gain of the SGLD
over the standard Euler method in the limit of large data sets.

我们研究了使用固定步长的随机梯度 Langevin 动力学（SGLD）方法的特点及其偏差，并提出了一个修正的 SGLD 方法，在步长的一阶上消除了由于随机梯度方差引起的渐近偏差，并且得到了有限时间偏差、方差和均方误差（MSE）的界限。