BriefGPT.xyz
Apr, 2021
上下文臂带中的离线风险评估
Off-Policy Risk Assessment in Contextual Bandits
HTML
PDF
Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli
TL;DR
该论文提出了一种基于Lipschitz风险函数的离线策略评估框架,使用OPRA估算目标策略的CDF,提供了对任何Lipschitz风险集合的插值估计,具有同时保证整个类的有限样本保证,并使用重要性采样和双重稳健估计实例化OPRA。
Abstract
To evaluate prospective contextual bandit policies when experimentation is not possible, practitioners often rely on
off-policy evaluation
, using data collected under a behavioral policy. While
off-policy evaluation
→