Beyond Vibe Testing: Smarter Eval for Agentic AI

08/09/2025 22 min Episodio 5

Listen "Beyond Vibe Testing: Smarter Eval for Agentic AI"

Descargar episodio Ver en sitio original

Episode Synopsis

In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.

We talked about:

Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliability.
How leading models perform inconsistently across single-turn and multi-turn enterprise tasks.
Why benchmark scores are weak predictors of operational success in production.
The role of inference-time tactics in reducing variance and improving stability.
NeuroMetric’s new platform: ITC Test Engine and drag-and-drop interface for experimentation.
Challenges in building agentic systems, from database integration to managing multi-prompt complexity.
Why large language models’ stochastic nature conflicts with business demands for reliability.
Latency, cost, and rate limits as major bottlenecks in scaling agentic workflows.
The limits of “vibe testing” and why rigorous evaluation frameworks are essential.
How Google’s Stacks tool speeds up evaluation with LLM-as-judge, and why it still falls short for enterprise needs.

Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/

Connect with Neurometric:
Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social

Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay

Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc

Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

More episodes of the podcast Inference Time Tactics

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success 22/12/2025

The Thinking Algorithm Leaderboard: Why No Single Model Wins 16/12/2025

Benchmarking Generalization: How AI Learns Beyond Training Data 05/11/2025

Solving the Cold Start Problem in AI Inference 03/10/2025

From MIT Decoding Research to Today’s Inference Tradeoffs 30/09/2025

Drag, Drop, and Deploy: Rethinking How We Build AI Systems 22/09/2025

GPT-5, The $100B Gap, and The Economics of Inference 29/08/2025

When AI Overthinks: Lessons from the Illusion of Thinking Paper 18/08/2025

The Strategic Trade Offs Behind Inference Time Compute Decisions 12/08/2025

Why Inference Time Compute Is the Future of AI 01/08/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Beyond Vibe Testing: Smarter Eval for Agentic AI

Listen "Beyond Vibe Testing: Smarter Eval for Agentic AI"

Episode Synopsis

More episodes of the podcast Inference Time Tactics

Bandwidth: Broadband or Narrowband?

Positive Attitude, Share your ZARZA Attitude!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD