BriefGPT.xyz
Mar, 2021
Off-policy Evaluation的非渐进置信区间:原始和对偶界限
Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
HTML
PDF
Yihao Feng, Ziyang Tang, Na Zhang, Qiang Liu
TL;DR
本文提出一种基于原始-对偶优化的算法,用于构建非渐进置信区间,该算法利用了Feng等人(2019年)的核贝尔曼损失(KBL)和适用于具有未知混合条件的时间依赖数据的新的鞅集中不等式,明确展示了算法的优势。
Abstract
off-policy evaluation
(OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying
reinforcement lea
→