Increasing users' positive interactions, such as purchases or clicks, is an
important objective of recommender systems. Recommenders typically aim to
select items that users will interact with. If the recommended items are
purchased, an increase in sales is expected. However, the items could have been
purchased even without recommendation. Thus, we want to recommend items that
results in purchases caused by recommendation. This can be formulated as a
ranking problem in terms of the causal effect. Despite its importance, this
problem has not been well explored in the related research. It is challenging
because the ground truth of causal effect is unobservable, and estimating the
causal effect is prone to the bias arising from currently deployed
recommenders. This paper proposes an unbiased learning framework for the causal
effect of recommendation. Based on the inverse propensity scoring technique,
the proposed framework first constructs unbiased estimators for ranking
metrics. Then, it conducts empirical risk minimization on the estimators with
propensity capping, which reduces variance under finite training samples. Based
on the framework, we develop an unbiased learning method for the causal effect
extension of a ranking metric. We theoretically analyze the unbiasedness of the
proposed method and empirically demonstrate that the proposed method
outperforms other biased learning methods in various settings.

本论文提出了一种基于逆倾向评分技术的无偏学习框架，以解决推荐系统中的因果效应问题。通过构建无偏估计器和进行经验风险最小化，该框架可以有效地提高有限训练样本下的方差，从而开发出一种基于无偏学习方法的因果效应扩展排序度量方法。该方法在各种设置下的性能均优于其他有偏学习方法。

Unbiased Learning for the Causal Effect of Recommendation

In most real-world recommender systems, the observed rating data are subject
to selection bias, and the data are thus missing-not-at-random. Developing a
method to facilitate the learning of a recommender with biased feedback is one
of the most challenging problems, as it is widely known that naive approaches
under selection bias often lead to suboptimal results. A well-established
solution for the problem is using propensity scoring techniques. The propensity
score is the probability of each data being observed, and unbiased performance
estimation is possible by weighting each data by the inverse of its propensity.
However, the performance of the propensity-based unbiased estimation approach
is often affected by choice of the propensity estimation model or the high
variance problem. To overcome these limitations, we propose a model-agnostic
meta-learning method inspired by the asymmetric tri-training framework for
unsupervised domain adaptation. The proposed method utilizes two predictors to
generate data with reliable pseudo-ratings and another predictor to make the
final predictions. In a theoretical analysis, a propensity-independent upper
bound of the true performance metric is derived, and it is demonstrated that
the proposed method can minimize this bound. We conduct comprehensive
experiments using public real-world datasets. The results suggest that the
previous propensity-based methods are largely affected by the choice of
propensity models and the variance problem caused by the inverse propensity
weighting. Moreover, we show that the proposed meta-learning method is robust
to these issues and can facilitate in developing effective recommendations from
biased explicit feedback.

提出了一种元学习方法，受非对称三训练框架的启发，利用两个预测器生成可靠的伪评分数据和另一个预测器进行最终预测，以解决推荐系统中观察到的评分数据选择偏差问题，实现从有偏显式反馈中开发出有效的推荐。

用于去偏差缺失非随机显式反馈的不对称三训练

Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit  Feedback

We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.

开发了一种学习原则和一种有效算法，用于从记录的 bandit 反馈中进行批处理学习。由此产生的 Counterfactual Risk Minimization 原则提供了 POEM 的新学习方法，用于学习结构化输出预测的随机线性规则。