On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

07/12/2025 13 min

Listen "On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference"

Episode Synopsis

This paper analyzes the fundalmental limitations of Best-of-N (BoN) sampling, proving theoretically that they are suboptimal under a mixture-of-reference-policies model. They propose RF-SeqBoN as a sequential approach that improves efficiency by selectively incorporating only **high-reward generations** back into the LLM's context, thereby concentrating computation on superior policy candidates. Both the theoretical analysis and extensive empirical results on diverse reasoning benchmarks confirm that RF-SeqBoN achieves a **strictly better performance-to-budget trade-off** compared to existing TTC baselines.

More episodes of the podcast Best AI papers explained