The paper introduces the first formulation of convex Q-learning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, and a significant relationship between the solution to the new convex program, and the solution to standard Q-learning. The second set of contributions concern algorithm design and analysis: (i) A direct model-free method for approximating the convex program for Q-learning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a mean-square sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.

引入了对带有函数逼近的马尔可夫决策过程进行凸 Q 学习的第一种形式化。该论文主要贡献包括：对该凸松弛性质的属性进行了鉴定，提供了一种近似凸程序的直接模型无关方法，证明了所提出算法的收敛性，并介绍了计算速率。同时，该方法可以推广到多种性能指标，并通过经典库存控制问题进行了实证验证。

随机环境中的凸Q学习：扩展版