Deployment efficiency is an important criterion for many real-world
applications of reinforcement learning (RL). Despite the community's increasing
interest, there lacks a formal theoretical formulation for the problem. In this
paper, we propose such a formulation for deployment-efficient RL (DE-RL) from
an "optimization with constraints" perspective: we are interested in exploring
an MDP and obtaining a near-optimal policy within minimal \emph{deployment
complexity}, whereas in each deployment the policy can sample a large batch of
data. Using finite-horizon linear MDPs as a concrete structural model, we
reveal the fundamental limit in achieving deployment efficiency by establishing
information-theoretic lower bounds, and provide algorithms that achieve the
optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible
and can serve as a building block for other practically relevant settings; we
give "Safe DE-RL" and "Sample-Efficient DE-RL" as two examples, which may be
worth future investigation.

本篇论文基于 “约束优化” 的思想，提出了一种针对 RL 的”deployment efficiency“问题的理论表述，并使用有限时间线性 MDP 作为具体结构模型，揭示了在获取最佳策略的同时实现最小 “deployment complexity” 的最优部署效率的限制，并提供了相应的算法。此外，该表述还可以作为其他实际相关设置的构建块，具有灵活性。两个实例是 “安全 DE-RL” 和 “样本高效 DE-RL”，这些值得未来研究。