It is a long-standing question to discover causal relations among a set of
variables in many empirical sciences. Recently, Reinforcement Learning (RL) has
achieved promising results in causal discovery from observational data.
However, searching the space of directed graphs and enforcing acyclicity by
implicit penalties tend to be inefficient and restrict the existing RL-based
method to small scale problems. In this work, we propose a novel RL-based
approach for causal discovery, by incorporating RL into the ordering-based
paradigm. Specifically, we formulate the ordering search problem as a
multi-step Markov decision process, implement the ordering generating process
with an encoder-decoder architecture, and finally use RL to optimize the
proposed model based on the reward mechanisms designed for~each ordering. A
generated ordering would then be processed using variable selection to obtain
the final causal graph. We analyze the consistency and computational complexity
of the proposed method, and empirically show that a pretrained model can be
exploited to accelerate training. Experimental results on both synthetic and
real data sets shows that the proposed method achieves a much improved
performance over existing RL-based method.

本篇研究提出了一种新颖的基于强化学习 (RL) 的因果推断方法，通过将 RL 合并到基于排序的模式中，并通过一个编码器 - 解码器架构实现排序生成过程，并最终使用 RL 优化所提出的模型来处理生成的排序，以获得最终的因果图。在合成和真实数据集上的实验结果表明，所提出的方法比现有的 RL-based 方法具有更好的性能。