In reinforcement Learning (RL), an instant reward signal is generated for
each action of the agent, such that the agent learns to maximize the cumulative
reward to obtain the optimal policy. However, in many real-world applications,
the instant reward signals are not obtainable by the agent. Instead, the
learner only obtains rewards at the ends of bags, where a bag is defined as a
partial sequence of a complete trajectory. In this situation, the learner has
to face the significant difficulty of exploring the unknown instant rewards in
the bags, which could not be addressed by existing approaches, including those
trajectory-based approaches that consider only complete trajectories and ignore
the inner reward distributions. To formally study this situation, we introduce
a novel RL setting termed Reinforcement Learning from Bagged Rewards (RLBR),
where only the bagged rewards of sequences can be obtained. We provide the
theoretical study to establish the connection between RLBR and standard RL in
Markov Decision Processes (MDPs). To effectively explore the reward
distributions within the bagged rewards, we propose a Transformer-based reward
model, the Reward Bag Transformer (RBT), which uses the self-attention
mechanism for interpreting the contextual nuances and temporal dependencies
within each bag. Extensive experimental analyses demonstrate the superiority of
our method, particularly in its ability to mimic the original MDP's reward
distribution, highlighting its proficiency in contextual understanding and
adaptability to environmental dynamics.

提出了一种称为 RLBR（Reinforcement Learning from Bagged Rewards）的新型 RL 设置，使用基于 Transformer 的奖励模型（Reward Bag Transformer）来探索袋装奖励中的奖励分布，并展示了其在上下文理解和环境动态适应性方面的卓越性能。