We introduce learning and planning algorithms for average-reward MDPs,
including 1) the first general proven-convergent off-policy model-free control
algorithm without reference states, 2) the first proven-convergent off-policy
model-free prediction algorithm, and 3) the first off-policy learning algorithm
that converges to the actual value function rather than to the value function
plus an offset. All of our algorithms are based on using the
temporal-difference error rather than the conventional error when updating the
estimate of the average reward. Our proof techniques are a slight
generalization of those by Abounadi, Bertsekas, and Borkar (2001). In
experiments with an Access-Control Queuing Task, we show some of the
difficulties that can arise when using methods that rely on reference states
and argue that our new algorithms can be significantly easier to use.

本研究提出了一种基于平均报酬 MDPs 的学习和规划算法，其中包括第一种无参考状态的普遍证明收敛的无模型控制算法、第一个证明收敛的无政策自由预测算法，以及第一个离线学习算法，其收敛于实际值函数而不是值函数增加一个偏移量。在使用时间差错错误而不是常规错误更新平均报酬估计时，我们的所有算法都基于此。

平均奖励马尔科夫决策过程的学习和规划

Learning and Planning in Average-Reward Markov Decision Processes

In this paper, we focus on general-purpose Distributed Stream Data Processing
Systems (DSDPSs), which deal with processing of unbounded streams of continuous
data at scale distributedly in real or near-real time. A fundamental problem in
a DSDPS is the scheduling problem with the objective of minimizing average
end-to-end tuple processing time. A widely-used solution is to distribute
workload evenly over machines in the cluster in a round-robin manner, which is
obviously not efficient due to lack of consideration for communication delay.
Model-based approaches do not work well either due to the high complexity of
the system environment. We aim to develop a novel model-free approach that can
learn to well control a DSDPS from its experience rather than accurate and
mathematically solvable system models, just as a human learns a skill (such as
cooking, driving, swimming, etc). Specifically, we, for the first time, propose
to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free
control in DSDPSs; and present design, implementation and evaluation of a novel
and highly effective DRL-based control framework, which minimizes average
end-to-end tuple processing time by jointly learning the system environment via
collecting very limited runtime statistics data and making decisions under the
guidance of powerful Deep Neural Networks. To validate and evaluate the
proposed framework, we implemented it based on a widely-used DSDPS, Apache
Storm, and tested it with three representative applications. Extensive
experimental results show 1) Compared to Storm's default scheduler and the
state-of-the-art model-based method, the proposed framework reduces average
tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed
framework can quickly reach a good scheduling solution during online learning,
which justifies its practicability for online control in DSDPSs.

该论文提出了一种使用深度强化学习实现分布式流数据处理系统无模型控制的新方法，并通过实验验证其在处理元组的时间效率方面的有效性和实用性。