A common issue in learning decision-making policies in data-rich settings is
spurious correlations in the offline dataset, which can be caused by hidden
confounders. Instrumental variable (IV) regression, which utilises a key
unconfounded variable known as the instrument, is a standard technique for
learning causal relationships between confounded action, outcome, and context
variables. Most recent IV regression algorithms use a two-stage approach, where
a deep neural network (DNN) estimator learnt in the first stage is directly
plugged into the second stage, in which another DNN is used to estimate the
causal effect. Naively plugging the estimator can cause heavy bias in the
second stage, especially when regularisation bias is present in the first stage
estimator. We propose DML-IV, a non-linear IV regression method that reduces
the bias in two-stage IV regressions and effectively learns high-performing
policies. We derive a novel learning objective to reduce bias and design the
DML-IV algorithm following the double/debiased machine learning (DML)
framework. The learnt DML-IV estimator has strong convergence rate and
$O(N^{-1/2})$ suboptimality guarantees that match those when the dataset is
unconfounded. DML-IV outperforms state-of-the-art IV regression methods on IV
regression benchmarks and learns high-performing policies in the presence of
instruments.

利用双 / 去偏机器学习框架设计的 DML-IV 算法，有效减小两阶段 IV 回归中的偏差并学习高性能策略。

通过双机器学习学习决策策略的工具变量

Learning Decision Policies with Instrumental Variables through Double  Machine Learning

In order to mitigate some of the inefficiencies of Reinforcement Learning
(RL), modular approaches composing different decision-making policies to derive
agents capable of performing a variety of tasks have been proposed. The modules
at the basis of these architectures are generally reusable, also allowing for
"plug-and-play" integration. However, such solutions still lack the ability to
process and integrate multiple types of information (knowledge), such as rules,
sub-goals, and skills. We propose Augmented Modular Reinforcement Learning
(AMRL) to address these limitations. This new framework uses an arbitrator to
select heterogeneous modules and seamlessly incorporate different types of
knowledge. Additionally, we introduce a variation of the selection mechanism,
namely the Memory-Augmented Arbitrator, which adds the capability of exploiting
temporal information. We evaluate the proposed mechanisms on established as
well as new environments and benchmark them against prominent deep RL
algorithms. Our results demonstrate the performance improvements that can be
achieved by augmenting traditional modular RL with other forms of heterogeneous
knowledge.

通过使用调解者选择异质性模块和平稳地融合不同类型的知识，提出了增强型模块化强化学习 (AMRL) 框架，并引入了选择机制的变体，即记忆增强型调解者，以利用时间信息，评估结果表明增强传统模块化 RL 的其他形式的异质知识可以提高性能