Motivated by the recent discovery of a statistical and computational
reduction from contextual bandits to offline regression (Simchi-Levi and Xu,
2021), we address the general (stochastic) Contextual Markov Decision Process
(CMDP) problem with horizon H (as known as CMDP with H layers). In this paper,
we introduce a reduction from CMDPs to offline density estimation under the
realizability assumption, i.e., a model class M containing the true underlying
CMDP is provided in advance. We develop an efficient, statistically
near-optimal algorithm requiring only O(HlogT) calls to an offline density
estimation algorithm (or oracle) across all T rounds of interaction. This
number can be further reduced to O(HloglogT) if T is known in advance. Our
results mark the first efficient and near-optimal reduction from CMDPs to
offline density estimation without imposing any structural assumptions on the
model class. A notable feature of our algorithm is the design of a layerwise
exploration-exploitation tradeoff tailored to address the layerwise structure
of CMDPs. Additionally, our algorithm is versatile and applicable to pure
exploration tasks in reward-free reinforcement learning.

本文提出了一种从上下文马尔科夫决策过程到离线密度估计的高效、近似最优的转化算法，同时解决了无结构假设的模型类 CMDPs。

基于分层探索 - 利用权衡的离线 Oracle 高效学习上下文 MDP

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise  Exploration-Exploitation Tradeoff

In the dynamic and rapid tactic involvements of turn-based sports, badminton
stands out as an intrinsic paradigm that requires alter-dependent
decision-making of players. While the advancement of learning from offline
expert data in sequential decision-making has been witnessed in various
domains, how to rally-wise imitate the behaviors of human players from offline
badminton matches has remained underexplored. Replicating opponents' behavior
benefits players by allowing them to undergo strategic development with
direction before matches. However, directly applying existing methods suffers
from the inherent hierarchy of the match and the compounding effect due to the
turn-based nature of players alternatively taking actions. In this paper, we
propose RallyNet, a novel hierarchical offline imitation learning model for
badminton player behaviors: (i) RallyNet captures players' decision
dependencies by modeling decision-making processes as a contextual Markov
decision process. (ii) RallyNet leverages the experience to generate context as
the agent's intent in the rally. (iii) To generate more realistic behavior,
RallyNet leverages Geometric Brownian Motion (GBM) to model the interactions
between players by introducing a valuable inductive bias for learning player
behaviors. In this manner, RallyNet links player intents with interaction
models with GBM, providing an understanding of interactions for sports
analytics. We extensively validate RallyNet with the largest available
real-world badminton dataset consisting of men's and women's singles,
demonstrating its ability to imitate player behaviors. Results reveal
RallyNet's superiority over offline imitation learning methods and
state-of-the-art turn-based approaches, outperforming them by at least 16% in
mean rule-based agent normalization score. Furthermore, we discuss various
practical use cases to highlight RallyNet's applicability.

提出了一种新的用于模拟羽毛球运动员行为的层次化离线模仿学习模型 RallyNet，它能够捕捉决策依赖关系，并通过引入几何布朗运动（GBM）来模拟球员之间的交互，提供了对体育分析的交互模型的理解，验证结果表明 RallyNet 在模仿球员行为方面优于离线模仿学习方法和现有的逐回合方法，规则化代理得分至少比它们高出 16% 以上，并且讨论了 RallyNet 的各种实际应用案例。

通过体验背景和布朗运动进行羽毛球选手的离线仿真行为

Offline Imitation of Badminton Player Behavior via Experiential Contexts  and Brownian Motion

We consider a planning problem where the dynamics and rewards of the
environment depend on a hidden static parameter referred to as the context. The
objective is to learn a strategy that maximizes the accumulated reward across
all contexts. The new model, called Contextual Markov Decision Process (CMDP),
can model a customer's behavior when interacting with a website (the learner).
The customer's behavior depends on gender, age, location, device, etc. Based on
that behavior, the website objective is to determine customer characteristics,
and to optimize the interaction between them. Our work focuses on one basic
scenario--finite horizon with a small known number of possible contexts. We
suggest a family of algorithms with provable guarantees that learn the
underlying models and the latent contexts, and optimize the CMDPs. Bounds are
obtained for specific naive implementations, and extensions of the framework
are discussed, laying the ground for future research.

论文讨论了一种名为 CMDP 的新模型，可模拟顾客在与网站交互时的行为，并基于此行为决定顾客特征，优化交互。作者提出了一系列算法，可以学习潜在的模型和上下文，并优化 CMDPs。