Maximizing long-term rewards is the primary goal in sequential
decision-making problems. The majority of existing methods assume that side
information is freely available, enabling the learning agent to observe all
features' states before making a decision. In real-world problems, however,
collecting beneficial information is often costly. That implies that,