In modern recommendation systems, unbiased learning-to-rank (LTR) is crucial
for prioritizing items from biased implicit user feedback, such as click data.
Several techniques, such as Inverse Propensity Weighting (IPW), have been
proposed for single-sided markets. However, less attention has been paid to
two-sided markets, such as job platforms or dating services, where successful
conversions require matching preferences from both users. This paper addresses
the complex interaction of biases between users in two-sided markets and
proposes a tailored LTR approach. We first present a formulation of feedback
mechanisms in two-sided matching platforms and point out that their implicit
feedback may include position bias from both user groups. On the basis of this
observation, we extend the IPW estimator and propose a new estimator, named
two-sided IPW, to address the position bases in two-sided markets. We prove
that the proposed estimator satisfies the unbiasedness for the ground-truth
ranking metric. We conducted numerical experiments on real-world two-sided
platforms and demonstrated the effectiveness of our proposed method in terms of
both precision and robustness. Our experiments showed that our method
outperformed baselines especially when handling rare items, which are less
frequently observed in the training data.

本文针对两边市场中用户间的偏见相互作用问题，提出了一个定制的无偏学习排序方法，证明了该方法满足真实排名度量的无偏性，并通过实验证明了该方法在处理稀有物品时优于基线方法。

一个基于 IPW 的无偏双边市场排序指标

An IPW-based Unbiased Ranking Metric in Two-sided Markets

Learning optimal policies from historical data enables personalization in a
wide variety of applications including healthcare, digital recommendations, and
online education. The growing policy learning literature focuses on settings
where the data collection rule stays fixed throughout the experiment. However,
adaptive data collection is becoming more common in practice, from two primary
sources: 1) data collected from adaptive experiments that are designed to
improve inferential efficiency; 2) data collected from production systems that
progressively evolve an operational policy to improve performance over time
(e.g. contextual bandits). Yet adaptivity complicates the optimal policy
identification ex post, since samples are dependent, and each treatment may not
receive enough observations for each type of individual. In this paper, we make
initial research inquiries into addressing the challenges of learning the
optimal policy with adaptively collected data. We propose an algorithm based on
generalized augmented inverse propensity weighted (AIPW) estimators, which
non-uniformly reweight the elements of a standard AIPW estimator to control
worst-case estimation variance. We establish a finite-sample regret upper bound
for our algorithm and complement it with a regret lower bound that quantifies
the fundamental difficulty of policy learning with adaptive data. When equipped
with the best weighting scheme, our algorithm achieves minimax rate optimal
regret guarantees even with diminishing exploration. Finally, we demonstrate
our algorithm's effectiveness using both synthetic data and public benchmark
datasets.

本文探讨在自适应数据收集环境下如何使用基于加权的估计算法来学习最优策略，提出了基于广义增强的倾向性加权（AIPW）估计器的算法，并建立了有限样本遗憾上限，证明最优权重方案下，算法即使在减少探索数据的情况下也能实现最小化的遗憾保证。

自适应数据采集的政策学习

Policy Learning with Adaptively Collected Data

We present a new approach to the problems of evaluating and learning
personalized decision policies from observational data of past contexts,
decisions, and outcomes. Only the outcome of the enacted decision is available
and the historical policy is unknown. These problems arise in personalized
medicine using electronic health records and in internet advertising. Existing
approaches use inverse propensity weighting (or, doubly robust versions) to
make historical outcome (or, residual) data look like it were generated by a
new policy being evaluated or learned. But this relies on a plug-in approach
that rejects data points with a decision that disagrees with the new policy,
leading to high variance estimates and ineffective learning. We propose a new,
balance-based approach that too makes the data look like the new policy but
does so directly by finding weights that optimize for balance between the
weighted data and the target policy in the given, finite sample, which is
equivalent to minimizing worst-case or posterior conditional mean square error.
Our policy learner proceeds as a two-level optimization problem over policies
and weights. We demonstrate that this approach markedly outperforms existing
ones both in evaluation and learning, which is unsurprising given the wider
support of balance-based weights. We establish extensive theoretical
consistency guarantees and regret bounds that support this empirical success.

提出了一种基于平衡的权重方法来评估和学习个性化决策策略，该方法适用于利用历史记录来进行个性化医疗和互联网广告，该方法明显优于现有方法。