We study the theoretical aspects of CLMs (Controllable Language Models) from a bi-objective optimization perspective. Specifically, we consider the CLMs as an off-policy RL problem that requires simultaneously maximizing the reward and likelihood objectives. Our main contribution consists of three parts. First, we establish the theoretical foundations of CLM by presenting reward upper bound and Pareto improvement/optimality conditions. Second, we analyze conditions that improve and violate Pareto optimality itself, respectively. Finally, we propose Reward Dropout, a simple yet powerful method to guarantee policy improvement based on a Pareto improvement condition. Our theoretical outcomes are supported by not only deductive proofs but also empirical results. The performance of Reward Dropout was evaluated on five CLM benchmark datasets, and it turns out that the Reward Dropout significantly improves the performance of CLMs.

我们从双目标优化的角度研究了可控语言模型的理论方面。我们将可控语言模型视为一个异策略强化学习问题，需要同时最大化奖励和似然目标。我们的主要贡献包括建立可控语言模型的理论基础，提出奖励上界和帕累托改进/最优条件；分析改进和违反帕累托最优性的条件；提出奖励丢弃机制，一种简单而强大的方法来保证基于帕累托改进条件的策略改进。我们的理论结果不仅通过演绎证明支持，还通过经验结果加以验证。我们在五个可控语言模型基准数据集上评估了奖励丢弃的性能，结果表明奖励丢弃显著提升了可控语言模型的性能。

可控语言模型的双目标视角：奖励丢弃改善离线策略控制性能