It has become increasingly common for data to be collected adaptively, for
example using contextual bandits. Historical data of this type can be used to
evaluate other treatment assignment policies to guide future innovation or
experiments. However, policy evaluation is challenging if the target policy
differs from the one used to collect data, and popular estimators, including
doubly robust (DR) estimators, can be plagued by bias, excessive variance, or
both. In particular, when the pattern of treatment assignment in the collected
data looks little like the pattern generated by the policy to be evaluated, the
importance weights used in DR estimators explode, leading to excessive
variance.
In this paper, we improve the DR estimator by adaptively weighting
observations to control its variance. We show that a t-statistic based on our
improved estimator is asymptotically normal under certain conditions, allowing
us to form confidence intervals and test hypotheses. Using synthetic data and
public benchmarks, we provide empirical evidence for our estimator's improved
accuracy and inferential properties relative to existing alternatives.

本文通过自适应加权控制方差，改进了重复鲁棒估计器，并且使用合成数据和公开基准测试提供了经验证据，相较于现有方案，我们的估计器具有更高的精确性和推论属性。

通过自适应加权利用来自上下文 Bandits 的数据进行离线策略评估

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual  Bandits

Sequence generation models are commonly refined with reinforcement learning
over user-defined metrics. However, high gradient variance hinders the
practical use of this method. To stabilize this method, we adapt to contextual
generation of categorical sequences a policy gradient estimator, which
evaluates a set of correlated Monte Carlo (MC) rollouts for variance control.
Due to the correlation, the number of unique rollouts is random and adaptive to
model uncertainty; those rollouts naturally become baselines for each other,
and hence are combined to effectively reduce gradient variance. We also
demonstrate the use of correlated MC rollouts for binary-tree softmax models,
which reduce the high generation cost in large vocabulary scenarios by
decomposing each categorical action into a sequence of binary actions. We
evaluate our methods on both neural program synthesis and image captioning. The
proposed methods yield lower gradient variance and consistent improvement over
related baselines.

该研究提出了一种针对分类序列生成的策略梯度估计器 —— 基于相关性蒙特卡洛树的滚动策略梯度估计器，该方法通过生成一组相关的蒙特卡洛树来控制方差，从而有效地降低了梯度方差，同时可以缩短大词汇场景下分类的生成成本。

上下文类别序列生成的自适应相关蒙特卡罗方法

Adaptive Correlated Monte Carlo for Contextual Categorical Sequence  Generation

Opinion summarization is the task of automatically creating summaries that
reflect subjective information expressed in multiple documents, such as product
reviews. While the majority of previous work has focused on the extractive
setting, i.e., selecting fragments from input reviews to produce a summary, we
let the model generate novel sentences and hence produce abstractive summaries.
Recent progress in summarization has seen the development of supervised models
which rely on large quantities of document-summary pairs. Since such training
data is expensive to acquire, we instead consider the unsupervised setting, in
other words, we do not use any summaries in training. We define a generative
model for a review collection which capitalizes on the intuition that when
generating a new review given a set of other reviews of a product, we should be
able to control the "amount of novelty" going into the new review or,
equivalently, vary the extent to which it deviates from the input. At test
time, when generating summaries, we force the novelty to be minimal, and
produce a text reflecting consensus opinions. We capture this intuition by
defining a hierarchical variational autoencoder model. Both individual reviews
and the products they correspond to are associated with stochastic latent
codes, and the review generator ("decoder") has direct access to the text of
input reviews through the pointer-generator mechanism. Experiments on Amazon
and Yelp datasets, show that setting at test time the review's latent code to
its mean, allows the model to produce fluent and coherent summaries reflecting
common opinions.

本研究提出了一种基于变分自编码器的生成模型，可以在无监督学习的条件下，通过控制输入文本的变异程度，生成简明扼要、持有共识观点的评论摘要。