We consider the problem of learning personalized decision policies on
observational data from heterogeneous data sources. Moreover, we examine this
problem in the federated setting where a central server aims to learn a policy
on the data distributed across the heterogeneous sources without exchanging
their raw data. We present a federated policy learning algorithm based on
aggregation of local policies trained with doubly robust offline policy
evaluation and learning strategies. We provide a novel regret analysis for our
approach that establishes a finite-sample upper bound on a notion of global
regret across a distribution of clients. In addition, for any individual
client, we establish a corresponding local regret upper bound characterized by
the presence of distribution shift relative to all other clients. We support
our theoretical findings with experimental results. Our analysis and
experiments provide insights into the value of heterogeneous client
participation in federation for policy learning in heterogeneous settings.

本文提出了一种基于聚合局部策略和双重稳健离线策略评估和学习策略的联邦策略学习算法，并针对异构数据源的观测数据情况，在不交换原始数据的情况下，在中央服务器上学习分布于异构数据源上的决策政策。

具有异构观测数据的联邦离线策略学习

Federated Offline Policy Learning with Heterogeneous Observational Data

We study the problem of learning personalized decision policies from
observational data while accounting for possible unobserved confounding.
Previous approaches, which assume unconfoundedness, i.e., that no unobserved
confounders affect both the treatment assignment as well as outcome, can lead
to policies that introduce harm rather than benefit when some unobserved
confounding is present, as is generally the case with observational data.
Instead, since policy value and regret may not be point-identifiable, we study
a method that minimizes the worst-case estimated regret of a candidate policy
against a baseline policy over an uncertainty set for propensity weights that
controls the extent of unobserved confounding. We prove generalization
guarantees that ensure our policy will be safe when applied in practice and
will in fact obtain the best-possible uniform control on the range of all
possible population regrets that agree with the possible extent of confounding.
We develop efficient algorithmic solutions to compute this confounding-robust
policy. Finally, we assess and compare our methods on synthetic and
semi-synthetic data. In particular, we consider a case study on personalizing
hormone replacement therapy based on observational data, where we validate our
results on a randomized experiment. We demonstrate that hidden confounding can
hinder existing policy learning approaches and lead to unwarranted harm, while
our robust approach guarantees safety and focuses on well-evidenced
improvement, a necessity for making personalized treatment policies learned
from observational data reliable in practice.

研究使用观察数据学习个性化决策策略时如何考虑可能的未观测混杂因素以及最小化候选策略的最坏估计后悔的方法和算法，以在保证安全和关注证据改进的前提下得到可靠的个性化治疗策略。

混淆鲁棒政策改进

Confounding-Robust Policy Improvement

We present a new approach to the problems of evaluating and learning
personalized decision policies from observational data of past contexts,
decisions, and outcomes. Only the outcome of the enacted decision is available
and the historical policy is unknown. These problems arise in personalized
medicine using electronic health records and in internet advertising. Existing
approaches use inverse propensity weighting (or, doubly robust versions) to
make historical outcome (or, residual) data look like it were generated by a
new policy being evaluated or learned. But this relies on a plug-in approach
that rejects data points with a decision that disagrees with the new policy,
leading to high variance estimates and ineffective learning. We propose a new,
balance-based approach that too makes the data look like the new policy but
does so directly by finding weights that optimize for balance between the
weighted data and the target policy in the given, finite sample, which is
equivalent to minimizing worst-case or posterior conditional mean square error.
Our policy learner proceeds as a two-level optimization problem over policies
and weights. We demonstrate that this approach markedly outperforms existing
ones both in evaluation and learning, which is unsurprising given the wider
support of balance-based weights. We establish extensive theoretical
consistency guarantees and regret bounds that support this empirical success.

提出了一种基于平衡的权重方法来评估和学习个性化决策策略，该方法适用于利用历史记录来进行个性化医疗和互联网广告，该方法明显优于现有方法。