Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(\pi \| \mu) $ with respect to $ \mu $ for some target $ \pi $, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

本文解决了包容性KL推断的数学分析工具缺乏的问题，提出了一种基于偏微分方程分析的通用近似包容性KL推断范式。通过此视角，多个已有的学习算法可以被统一视为包容性KL推断的特例，最重要的发现是为包容性KL散度的最小化提供了Wasserstein-Fisher-Rao梯度流的理论基础。

包容性KL最小化：一种Wasserstein-Fisher-Rao梯度流视角