BriefGPT.xyz
Sep, 2020
学习于极端赌博反馈
Learning from eXtreme Bandit Feedback
HTML
PDF
Romain Lopez, Inderjit Dhillon, Michael I. Jordan
TL;DR
本文介绍了一种名为POXM的算法,该算法以选择性的重要性采样器为基础,通过选择前p个日志策略的行动来学习来自极端的多标签分类(bandit feedback)任务中的行为数据,该方法在三个不同的XMC数据集上显着提高了性能,并将其与三个竞争方法进行了基准测试。
Abstract
We study the problem of
batch learning
from bandit feedback in the setting of
extremely large action spaces
. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of dec
→