The most promising recent methods for AI reasoning require applying variants of Reinforcement Learning (RL) either on rolled out trajectories from the model, even for the step-wise rewards, or large quantities of human annotated trajectory data. The reliance on the rolled-out trajector