具有监督学习保证的上下文强化学习算法

Feb, 2010

具有监督学习保证的上下文强化学习算法

An Optimal High Probability Algorithm for the Contextual Bandit Problem

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert E. Schapire

TL;DR本篇研究针对在线贝叶斯赌博场景下的学习问题，提出了两条新算法：Exp4.P 用于竞争N个专家，经过实证测试有效性；VE 用于竞争VC-dimension为d的无限策略集合，此两种算法均能降低遗憾值并为上下文赌博场景提供监督学习型保证，实现了对往期算法保证的优化。

Abstract

We consider the problem of learning to predict with expert advice in an adversarial, on-line bandit setting. We study how to behave in a way that achieves nearly as much reward as the best expert with high probability, rather than in expectation. We provide the →