Prompt-tuning has emerged as a promising method for adapting pre-trained
models to downstream tasks or aligning with human preferences. Prompt learning
is widely used in NLP but has limited applicability to RL due to the complex
physical meaning and environment-specific information contained within RL
prompts. These factors require supervised learning to imitate the
demonstrations and may result in a loss of meaning after learning.
Additionally, directly extending prompt-tuning approaches to RL is challenging
because RL prompts guide agent behavior based on environmental modeling and
analysis, rather than filling in missing information, making it unlikely that
adjustments to the prompt format for downstream tasks, as in NLP, can yield
significant improvements. In this work, we propose the Prompt-Tuning DT
algorithm to address these challenges by using trajectory segments as prompts
to guide RL agents in acquiring environmental information and optimizing
prompts via black-box tuning to enhance their ability to contain more relevant
information, thereby enabling agents to make better decisions. Our approach
involves randomly sampling a Gaussian distribution to fine-tune the elements of
the prompt trajectory and using preference ranking function to find the
optimization direction, thereby providing more informative prompts and guiding
the agent towards specific preferences in the target environment. Extensive
experiments show that with only 0.03% of the parameters learned, Prompt-Tuning
DT achieves comparable or even better performance than full-model fine-tuning
in low-data scenarios. Our work contributes to the advancement of prompt-tuning
approaches in RL, providing a promising direction for optimizing large RL
agents for specific preference tasks.

本文提出了 Prompt-Tuning DT 算法，使用轨迹段作为提示来指导强化学习（RL）代理获取环境信息并通过黑盒调整来优化提示，以提供更多相关信息和指导代理走向特定任务的方向，在低数据情况下仅学习 0.03％的参数即可实现与全模型微调相当甚至更好的性能，为 RL 中优化大型代理的特定任务提供了有前途的方向。