We consider the development of adaptive, instance-dependent algorithms for interactive decision making (bandits, reinforcement learning, and beyond) that, rather than only performing well in the worst case, adapt to favorable properties of real-world instances for improved performance. We aim for instance-optimality, a strong notion of adaptivity which asserts that, on any particular problem instance, the algorithm under consideration outperforms all consistent algorithms. Instance-optimality enjoys a rich asymptotic theory originating from the work of \citet{lai1985asymptotically,graves1997asymptotically}, but non-asymptotic guarantees have remained elusive outside of certain special cases. Even for problems as simple as tabular reinforcement learning, existing algorithms do not attain instance-optimal performance until the number of rounds of interaction is doubly exponential in the number of states. In this paper, we take the first step toward developing a non-asymptotic theory of instance-optimal decision making with general function approximation. We introduce a new complexity measure, the Allocation-Estimation Coefficient (AEC), and provide a new algorithm, $\mathsf{AE}^2$, which attains non-asymptotic instance-optimal performance at a rate controlled by the AEC. Our results recover the best known guarantees for well-studied problems such as finite-armed and linear bandits and, when specialized to tabular reinforcement learning, attain the first instance-optimal regret bounds with polynomial dependence on all problem parameters, improving over prior work exponentially. We complement these results with lower bounds that show that i) existing notions of statistical complexity are insufficient to derive non-asymptotic guarantees, and ii) under certain technical conditions, boundedness of the AEC is necessary to learn an instance-optimal allocation of decisions in finite time.

本研究旨在开发适应性算法，用于互动决策制定，并在实际状况好的实例中适应性地提高性能。研究提出了Allocation-Estimation Coefficient (AEC)的复杂度度量，并提出了新的算法AE2，它控制了AEC的速率，达到非渐近实例优化性能。这些结果补充了较差保证，即现有的统计复杂度概念不足以推导非渐近保证，并且在某些技术条件下，有限时间内学习实例优化决策分配需要限定AEC。

交互决策中的实例最优性：走向一个非渐近理论