We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

本文提出了一种新的离线策略估计方法，其中将重要性采样直接应用于平稳态访问分布，从而避免了现有估计器所面临的方差爆炸问题。通过仅从行为分布中采样轨迹，我们开发了一种估计密度比的新方法，并为估算问题设计了mini-max损失函数，并推导出了RKHS情况下的封闭形式解决方案。

打破视野的诅咒：无穷视野离线估计