Cumulative prospect theory (CPT) is known to model human decisions well, with
substantial empirical evidence supporting this claim. CPT works by distorting
probabilities and is more general than the classic expected utility and
coherent risk measures. We bring this idea to a risk-sensitive reinforcement
learning (RL) setting and design algorithms for both estimation and control.
The RL setting presents two particular challenges when CPT is applied:
estimating the CPT objective requires estimations of the entire distribution of
the value function and finding a randomized optimal policy. The estimation
scheme that we propose uses the empirical distribution to estimate the
CPT-value of a random variable. We then use this scheme in the inner loop of a
CPT-value optimization procedure that is based on the well-known simulation
optimization idea of simultaneous perturbation stochastic approximation (SPSA).
We provide theoretical convergence guarantees for all the proposed algorithms
and also illustrate the usefulness of CPT-based criteria in a traffic signal
control application.

本研究采用累积概率理论将风险敏感型强化学习应用到交通信号控制领域，并提出了一种估算方法和优化程序，保证了算法的收敛性。