We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.

我们提出使用对数损失函数训练拟合Q-迭代（FQI-LOG）进行批量强化学习。我们证明了使用FQI-LOG学习接近最优策略所需要的样本数量与最优策略的累积成本成比例，而在问题中，如果行为最优则可以达到目标且不会产生成本，所以最优策略的累积成本为零。通过这样做，我们为批量强化学习中的“小成本”界限提供了一个通用框架，即与最优可达成成本成比例的界限。此外，我们经验证明，在最优策略可靠达到目标的问题上，FQI-LOG使用的样本比使用平方损失训练的FQI要少。

切换损失减少批处理强化学习成本