Dylan J. Foster, Claudio Gentile, Mehryar Mohri, Julian Zimmert
TL;DR在这篇论文中,我们介绍了一种新的Oracle-efficient算法,适用于无限行动设置下的线性情境强化学习问题,该算法实现了最优的拟合程度依赖性回归(square loss regression)的后悔上限,使得它能够在未知的模型错误情况下灵活适应。
Abstract
A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards hav