Evaluations of Deep Reinforcement Learning (DRL) methods are an integral part
of scientific progress of the field. Beyond designing DRL methods for general
intelligence, designing task-specific methods is becoming increasingly
prominent for real-world applications. In these settings, the standard
evaluation practice involves using a few instances of Markov Decision Processes
(MDPs) to represent the task. However, many tasks induce a large family of MDPs
owing to variations in the underlying environment, particularly in real-world
contexts. For example, in traffic signal control, variations may stem from
intersection geometries and traffic flow levels. The select MDP instances may
thus inadvertently cause overfitting, lacking the statistical power to draw
conclusions about the method's true performance across the family. In this
article, we augment DRL evaluations to consider parameterized families of MDPs.
We show that in comparison to evaluating DRL methods on select MDP instances,
evaluating the MDP family often yields a substantially different relative
ranking of methods, casting doubt on what methods should be considered
state-of-the-art. We validate this phenomenon in standard control benchmarks
and the real-world application of traffic signal control. At the same time, we
show that accurately evaluating on an MDP family is nontrivial. Overall, this
work identifies new challenges for empirical rigor in reinforcement learning,
especially as the outcomes of DRL trickle into downstream decision-making.

本研究探讨如何更准确地评估深度强化学习方法在真实世界中的应用，并提出了考虑参数化 MDP 家族的方法。研究结果表明，在 MDP 家族上对 DRL 方法进行评估，相对于在用户选择的 MDP 实例上进行评估，往往可以得到不同的方法排名，这为强化学习的实证研究提出了新的挑战。