Policy design in non-stationary Markov Decision Processes (MDPs) is inherently challenging due to the complexities introduced by time-varying system transition and reward, which make it difficult for learners to determine the optimal actions for maximizing cumulative future rewards. Fortunately, in many practical applications, such as energy systems, look-ahead predictions are available, including forecasts for renewable energy generation and demand. In this paper, we leverage these look-ahead predictions and propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating such predictions. Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands. When the system prediction is subject to error, the regret does not explode even if the prediction error grows sub-exponentially as a function of the prediction horizon. We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.

本研究针对非平稳马尔可夫决策过程中的政策设计难题，提出了一种新的算法，利用前瞻预测信息降低悔恨。理论分析表明，在特定假设下，随着前瞻窗口的扩大，悔恨呈指数级减少，而在预测误差存在的情况下，悔恨不会剧增。我们的模拟验证了算法在非平稳环境中的有效性。

具有前瞻信息的非平稳马尔可夫决策过程的预测控制与悔恨分析