Standard model-free reinforcement learning algorithms optimize a policy that
generates the action to be taken in the current time step in order to maximize
expected future return. While flexible, it faces difficulties arising from the
inefficient exploration due to its single step nature. In this work, we present
Generative Planning method (GPM), which can generate actions not only for the
current step, but also for a number of future steps (thus termed as generative
planning). This brings several benefits to GPM. Firstly, since GPM is trained
by maximizing value, the plans generated from it can be regarded as intentional
action sequences for reaching high value regions. GPM can therefore leverage
its generated multi-step plans for temporally coordinated exploration towards
high value regions, which is potentially more effective than a sequence of
actions generated by perturbing each action at single step level, whose
consistent movement decays exponentially with the number of exploration steps.
Secondly, starting from a crude initial plan generator, GPM can refine it to be
adaptive to the task, which, in return, benefits future explorations. This is
potentially more effective than commonly used action-repeat strategy, which is
non-adaptive in its form of plans. Additionally, since the multi-step plan can
be interpreted as the intent of the agent from now to a span of time period
into the future, it offers a more informative and intuitive signal for
interpretation. Experiments are conducted on several benchmark environments and
the results demonstrated its effectiveness compared with several baseline
methods.

通过生成式规划方法可以更有效地进行值最大化的策略优化，从而实现对多步骤动作的生成和增强，进而提高探测效率和行动反应的自适应性。

强化学习中的时间协调探索的生成规划

Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning

Nearly fifteen years ago, Google unveiled the generalized second price (GSP)
auction. By all theoretical accounts including their own [Varian 14], this was
the wrong auction --- the Vickrey-Clarke-Groves (VCG) auction would have been
the proper choice --- yet GSP has succeeded spectacularly.
We give a deep justification for GSP's success: advertisers' preferences map
to a model we call value maximization, they do not maximize profit as the
standard theory would believe. For value maximizers, GSP is the truthful
auction [Aggarwal 09]. Moreover, this implies an axiomatization of GSP --- it
is an auction whose prices are truthful for value maximizers --- that can be
applied much more broadly than the simple model for which GSP was originally
designed. In particular, applying it to arbitrary single-parameter domains
recovers the folklore definition of GSP. Through the lens of value
maximization, GSP metamorphosizes into a powerful auction, sound in its
principles and elegant in its simplicity.

GSP auction succeeded due to advertisers' value maximization preferences for truthful auction pricing, supporting the broader application of truthful auction pricing to single-parameter pricing domains.