Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via Fine-Tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would the