We study here the problem of learning the exploration exploitation trade-off
in the contextual bandit problem with linear reward function setting. In the
traditional algorithms that solve the contextual bandit problem, the
exploration is a parameter that is tuned by the user. However, our proposed
algorithm learn to choose the right exploration parameters in an online manner
based on the observed context, and the immediate reward received for the chosen
action. We have presented here two algorithms that uses a bandit to find the
optimal exploration of the contextual bandit algorithm, which we hope is the
first step toward the automation of the multi-armed bandit algorithm.

本文探讨了在线学习环境下，通过使用赌博机算法来自动确定探索参数，优化上下文赌博算法探索与利用的平衡问题。

上下文赌博机的超参数调整

Hyper-parameter Tuning for the Contextual Bandit

We consider an adversarial variant of the classic $K$-armed linear contextual
bandit problem where the sequence of loss functions associated with each arm
are allowed to change without restriction over time. Under the assumption that
the $d$-dimensional contexts are generated i.i.d.~at random from a known
distributions, we develop computationally efficient algorithms based on the
classic Exp3 algorithm. Our first algorithm, RealLinExp3, is shown to achieve a
regret guarantee of $\widetilde{O}(\sqrt{KdT})$ over $T$ rounds, which matches
the best available bound for this problem. Our second algorithm, RobustLinExp3,
is shown to be robust to misspecification, in that it achieves a regret bound
of $\widetilde{O}((Kd)^{1/3}T^{2/3}) + \varepsilon \sqrt{d} T$ if the true
reward function is linear up to an additive nonlinear error uniformly bounded
in absolute value by $\varepsilon$. To our knowledge, our performance
guarantees constitute the very first results on this problem setting.

针对经典 $K$-armed 线性上下文对抗性问题，我们开发了基于 Exp3 算法的计算有效算法，其中包含实时算法和鲁棒算法，它们能够实现良好的失望保证，并且对于线性奖励函数而言具有稳健性。