Without direct interaction with the environment. Often, the environment in
which the data are collected differs from the environment in which the learned
policy is applied. To account for the effect of different environments during
learning and execution, distributionally robust optimization (DRO) methods have
been developed that compute worst-case bounds on the policy values assuming
that the distribution of the new environment lies within an uncertainty set.
Typically, this uncertainty set is defined based on the KL divergence around
the empirical distribution computed from the logging dataset. However, the KL
uncertainty set fails to encompass distributions with varying support and lacks
awareness of the geometry of the distribution support. As a result, KL
approaches fall short in addressing practical environment mismatches and lead
to over-fitting to worst-case scenarios. To overcome these limitations, we
propose a novel DRO approach that employs the Wasserstein distance instead.
While Wasserstein DRO is generally computationally more expensive compared to
KL DRO, we present a regularized method and a practical (biased) stochastic
gradient descent method to optimize the policy efficiently. We also provide a
theoretical analysis of the finite sample complexity and iteration complexity
for our proposed method. We further validate our approach using a public
dataset that was recorded in a randomized stoke trial.

提出了一种利用 Wasserstein 距离的分布鲁棒优化方法，用于解决环境不匹配的问题，并提供了理论分析和实证验证。