提高离线情境感知强化学习的分布鲁棒性

Nov, 2020

提高离线情境感知强化学习的分布鲁棒性

Improving Offline Contextual Bandits with Distributional Robustness

Otmane Sakhi, Louis Faury, Flavian Vasile

TL;DR本文扩展了分布鲁棒优化方法，提出了 Counterfactual Risk Minimization 原则的凸重构方法，介绍了通过 DRO 框架构建离线情境强化学习的渐近置信区间，使用了已知的鲁棒估计渐进性结果自动校准置信区间，并呈现了初步实验结果支持我们方法的有效性。

Abstract

This paper extends the distributionally robust optimization (DRO) approach for offline contextual bandits. Specifically, we leverage this framework to introduce a convex reformulation of the