We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the
agent's trajectory that improves the sample efficiency in sparse-reward MDPs.
We show that any optimal policy necessarily satisfies the k-SP constraint.
Notably, the k-SP constraint prevents the policy from exploring state-action
pairs along the non-k-SP trajectories (e.g., going back and forth). However, in
practice, excluding state-action pairs may hinder the convergence of RL
algorithms. To overcome this, we propose a novel cost function that penalizes
the policy violating SP constraint, instead of completely excluding it. Our
numerical experiment in a tabular RL setting demonstrates that the SP
constraint can significantly reduce the trajectory space of policy. As a
result, our constraint enables more sample efficient learning by suppressing
redundant exploration and exploitation. Our experiments on MiniGrid, DeepMind
Lab, Atari, and Fetch show that the proposed method significantly improves
proximal policy optimization (PPO) and outperforms existing novelty-seeking
exploration methods including count-based exploration even in continuous
control tasks, indicating that it improves the sample efficiency by preventing
the agent from taking redundant actions.

提出了 k-SP 约束条件，这是一种新颖的约束条件，可以提高稀疏奖励 MDP 中的样本效率。在数值实验中，通过减少策略的轨迹空间，实现了抑制冗余探索和利用，提高了样本效率，并展示了优于传统算法的成果。