In this paper, we study multi-task structured bandit problem where the goal
is to learn a near-optimal algorithm that minimizes cumulative regret. The
tasks share a common structure and the algorithm exploits the shared structure
to minimize the cumulative regret for an unseen but related test task. We use a
transformer as a decision-making algorithm to learn this shared structure so as
to generalize to the test task. The prior work of pretrained decision
transformers like DPT requires access to the optimal action during training
which may be hard in several scenarios. Diverging from these works, our
learning algorithm does not need the knowledge of optimal action per task
during training but predicts a reward vector for each of the actions using only
the observed offline data from the diverse training tasks. Finally, during
inference time, it selects action using the reward predictions employing
various exploration strategies in-context for an unseen test task. Our model
outperforms other SOTA methods like DPT, and Algorithmic Distillation over a
series of experiments on several structured bandit problems (linear, bilinear,
latent, non-linear). Interestingly, we show that our algorithm, without the
knowledge of the underlying problem structure, can learn a near-optimal policy
in-context by leveraging the shared structure across diverse tasks. We further
extend the field of pre-trained decision transformers by showing that they can
leverage unseen tasks with new actions and still learn the underlying latent
structure to derive a near-optimal policy. We validate this over several
experiments to show that our proposed solution is very general and has wide
applications to potentially emergent online and offline strategies at test
time. Finally, we theoretically analyze the performance of our algorithm and
obtain generalization bounds in the in-context multi-task learning setting.

本文研究多任务结构化赌博问题，目标是学习一个接近最优的算法以最小化累计遗憾。我们使用 Transformer 作为决策算法来学习该共享结构以便泛化到测试任务，并通过利用多样化的训练任务中的观测离线数据预测每个动作的奖励向量，而不需要训练期间对每个任务的最优动作的了解。在推断时，它使用奖励预测并采用各种探索策略在上下文中选择动作。我们的模型在几个结构化赌博问题上（线性、双线性、潜在、非线性）的一系列实验中优于其他 SOTA 方法，例如 DPT 和算法蒸馏。有趣的是，我们展示了即使没有了解潜在问题结构的情况下，我们的算法也能通过利用不同任务之间的共享结构来学习在上下文中的接近最优策略。我们进一步通过展示它们可以利用带有新动作的未见任务并仍然学习潜在结构来获得接近最优策略，从而扩展了预训练决策 Transformer 领域。我们通过几个实验证实了这一点，以展示我们的解决方案非常通用，并且在测试时具有广泛的潜在在线和离线策略应用。最后，我们在上下文多任务学习环境中理论上分析了我们算法的性能并获得了泛化界限。