Listen "The Thinking Algorithm Leaderboard: Why No Single Model Wins"
Episode Synopsis
In this episode of Inference Time Tactics, Cooper and Byron break down NeuroMetric's Thinking Algorithm Leaderboard and what it reveals about building production-ready AI agents. They share why prompt engineering with a single model won't cut it for enterprise use cases, explore the impact of inference-time compute strategies, and discuss what they learned from testing 10 models across real CRM tasks—from surprising token inefficiency to catastrophic failures in SQL generation.
We talked about:
Why NeuroMetric built the first leaderboard combining models with inference-time compute strategies.
How Salesforce's CRMArena-Pro reflects real multi-step business tasks better than pure reasoning benchmarks.
The jagged frontier: no single model or technique dominates across all tasks.
Why GPT 20B was surprisingly token inefficient—twice as slow as GPT 120B for similar accuracy.
How GPT-5 nano's conversational style broke SQL generation tasks completely.
Trading accuracy for speed: two-model ensembles versus five, and saving 20+ seconds per task.
Throughput constraints as a hidden bottleneck when scaling to production volumes.
Future directions: LLM-guided search, task clustering, and compression to specialized small models.
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Thinking Algorithm Leaderboard:
https://leaderboard.neurometric.ai/
Connect with Neurometric:
Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith
We talked about:
Why NeuroMetric built the first leaderboard combining models with inference-time compute strategies.
How Salesforce's CRMArena-Pro reflects real multi-step business tasks better than pure reasoning benchmarks.
The jagged frontier: no single model or technique dominates across all tasks.
Why GPT 20B was surprisingly token inefficient—twice as slow as GPT 120B for similar accuracy.
How GPT-5 nano's conversational style broke SQL generation tasks completely.
Trading accuracy for speed: two-model ensembles versus five, and saving 20+ seconds per task.
Throughput constraints as a hidden bottleneck when scaling to production volumes.
Future directions: LLM-guided search, task clustering, and compression to specialized small models.
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Thinking Algorithm Leaderboard:
https://leaderboard.neurometric.ai/
Connect with Neurometric:
Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.