What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.

在只有每个臂的单个样本的数据匮乏情况下，本研究展示了即使在这种情况下也可能找到与最优策略相竞争的策略，这为基于仅有少量样本进行可靠决策的场景开辟了道路。我们的分析揭示了离线决策中，随机策略可能比确定性策略显著优越。针对离线多臂赌博机，我们设计了一种名为TRUST的算法，它与主导的基于值的下界方法截然不同，其设计得益于定位法则、关键半径和相对悲观主义。我们证明了其样本复杂度与LCB在极小化极大问题上可比，而在样本极少问题上明显较低。最后，我们考虑了一个在已知记录策略的特殊情况下的离线强化学习应用。

数据稀缺情况下信赖区域增强的数据困局可靠决策