Listen "On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference"
Episode Synopsis
This paper analyzes the fundalmental limitations of Best-of-N (BoN) sampling, proving theoretically that they are suboptimal under a mixture-of-reference-policies model. They propose RF-SeqBoN as a sequential approach that improves efficiency by selectively incorporating only **high-reward generations** back into the LLM's context, thereby concentrating computation on superior policy candidates. Both the theoretical analysis and extensive empirical results on diverse reasoning benchmarks confirm that RF-SeqBoN achieves a **strictly better performance-to-budget trade-off** compared to existing TTC baselines.
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.