We observe that several existing policy gradient methods (such as vanilla
policy gradient, PPO, A2C) may suffer from overly large gradients when the
current policy is close to deterministic (even in some very simple
environments), leading to an unstable training process. To address thi