In the stochastic contextual low-rank matrix bandit problem, the expected
reward of an action is given by the inner product between the action's feature
matrix and some fixed, but initially unknown $d_1$ by $d_2$ matrix $\Theta^*$
with rank $r \ll \{d_1, d_2\}$, and an agent sequential