Automatic bill payment is an important part of business operations in fintech
companies. The practice of deduction was mainly based on the total amount or
heuristic search by dividing the bill into smaller parts to deduct as much as
possible. This article proposes an end-to-end approach of automatically
learning the optimal deduction paths (deduction amount in order), which reduces
the cost of manual path design and maximizes the amount of successful
deduction. Specifically, in view of the large search space of the paths and the
extreme sparsity of historical successful deduction records, we propose a deep
hierarchical reinforcement learning approach which abstracts the action into a
two-level hierarchical space: an upper agent that determines the number of
steps of deductions each day and a lower agent that decides the amount of
deduction at each step. In such a way, the action space is structured via prior
knowledge and the exploration space is reduced. Moreover, the inherited
information incompleteness of the business makes the environment just partially
observable. To be precise, the deducted amounts indicate merely the lower
bounds of the available account balance. To this end, we formulate the problem
as a partially observable Markov decision problem (POMDP) and employ an
environment correction algorithm based on the characteristics of the business.
In the world's largest electronic payment business, we have verified the
effectiveness of this scheme offline and deployed it online to serve millions
of users.

本文提出一种自动化结账方案，基于深层次强化学习方法解决了搜索空间巨大、历史记录匮乏等问题，通过构建分层的动作空间和部分观察的决策问题模型，在全球最大的电子支付业务上推广了这一方案。

通过强化学习与环境修正实现自动演绎路径学习

Automatic Deduction Path Learning via Reinforcement Learning with  Environmental Correction

The optimal multicast tree problem in the Software-Defined Networking (SDN)
multicast routing is an NP-hard combinatorial optimization problem. Although
existing SDN intelligent solution methods, which are based on deep
reinforcement learning, can dynamically adapt to complex network link state
changes, these methods are plagued by problems such as redundant branches,
large action space, and slow agent convergence. In this paper, an SDN
intelligent multicast routing algorithm based on deep hierarchical
reinforcement learning is proposed to circumvent the aforementioned problems.
First, the multicast tree construction problem is decomposed into two
sub-problems: the fork node selection problem and the construction of the
optimal path from the fork node to the destination node. Second, based on the
information characteristics of SDN global network perception, the multicast
tree state matrix, link bandwidth matrix, link delay matrix, link packet loss
rate matrix, and sub-goal matrix are designed as the state space of intrinsic
and meta controllers. Then, in order to mitigate the excessive action space,
our approach constructs different action spaces at the upper and lower levels.
The meta-controller generates an action space using network nodes to select the
fork node, and the intrinsic controller uses the adjacent edges of the current
node as its action space, thus implementing four different action selection
strategies in the construction of the multicast tree. To facilitate the
intelligent agent in constructing the optimal multicast tree with greater
speed, we developed alternative reward strategies that distinguish between
single-step node actions and multi-step actions towards multiple destination
nodes.

该研究提出了一种基于深度分层强化学习的 SDN 智能多播路由算法来解决现有算法存在的问题，并构建了信息特征的状态空间和不同的行动空间。此外，还开发了可区分单步节点行动和多步行动向多目标节点的替代奖励策略来加速构建最优多播树的智能代理。