BriefGPT.xyz
Apr, 2023
马尔可夫决策过程中的一致离线评估
Conformal Off-Policy Evaluation in Markov Decision Processes
HTML
PDF
Daniele Foffano, Alessio Russo, Alexandre Proutiere
TL;DR
提出了一种基于符合预测的OPE方法,可以在给定的一定置信水平下输出包含目标策略真实奖励的区间,并通过不同的方法处理由于目标策略和行为策略之间差异导致的分布偏移,并在保持相同置信水平的情况下,相对于现有方法降低区间长度。
Abstract
reinforcement learning
aims at identifying and evaluating efficient
control policies
from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner
→