BriefGPT.xyz
May, 2019
在线马尔可夫决策过程中全局凸奖励的强化学习中的勘探利用权衡
Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards
HTML
PDF
Wang Chi Cheung
TL;DR
研究了在Markov决策问题中,代理人通过在线凸规划算法设计非固定策略,以最大化全局凹奖励函数和矢量结果的均值,以解决多目标优化和Markov环境下的受限优化问题。
Abstract
We consider an agent who is involved in a
markov decision process
and receives a vector of outcomes every round. Her objective is to maximize a global
concave reward function
on the average vectorial outcome. The
→