The Hard Truths About AI Agents: Why Benchmarks Lie and Frameworks Fail

10/06/2025 39 min Episodio 2

Listen "The Hard Truths About AI Agents: Why Benchmarks Lie and Frameworks Fail"

Descargar episodio Ver en sitio original

Episode Synopsis

Building AI agents that actually work is harder than the hype suggests — and most people are doing it wrong. In this special "YAAP: Unplugged" episode (a live panel from AI Tinkerers meetup at the Hugging Face offices in Paris), Yuval sits down with Aymeric Roucher (Project Lead for Agents at Hugging Face) and Niv Granot (Algorithms Group Lead at AI21 Labs) for an unfiltered discussion about the uncomfortable realities of agent development.

Key Topics:
Why current benchmarks are broken: From MMLU's limitations to RAG leaderboards that don't reflect real-world performance
The tool use illusion: Why 95% accuracy on tool calling benchmarks doesn't mean your agent can actually plan
LLM-as-a-judge problems: How evaluation bottlenecks are capping progress compared to verifiable domains like coding
Framework: friend or foe? When to ditch LangChain, LlamaIndex, and why minimal implementations often work better
The real agent stack: MCP, sandbox environments, and the four essential components you actually need
Beyond the hype cycle: From embeddings that can't distinguish positive from negative numbers to what comes after agents
From FIFA World Cup benchmarks that expose retrieval failures to the circular dependency problem with LLM judges, this conversation cuts through the marketing noise to reveal what it really takes to build agents that solve real problems — not just impressive demos.
Warning: Contains unpopular opinions about popular frameworks and uncomfortable truths about the current state of AI agent development.

More episodes of the podcast YAAP (Yet Another AI Podcast)

The House That Builds Builders – The Origin Story of AGI House 11/11/2025

Scraping Without Getting Sued (Or Falling Asleep) 28/10/2025

The Judge Model Diaries: Judging the Judges 26/08/2025

RLVR Lets Models Fail Their Way to the Top 12/08/2025

RAG Is Not Solved – Your Evaluation Just Sucks 29/07/2025

The Call Is Coming From Inside the Agent (And It Has Your Credentials) 15/07/2025

Building Enterprise RAG: Lessons from 2+ Years of Production Deployments 01/07/2025

Trailer 19/06/2025

You Can’t Have an Agent Without a Plan: What 90% of ’Agents’ Are Missing 17/06/2025

Tool Calling 2.0: How MCP Is Standardizing AI Connections 29/05/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

The Hard Truths About AI Agents: Why Benchmarks Lie and Frameworks Fail

Listen "The Hard Truths About AI Agents: Why Benchmarks Lie and Frameworks Fail"

Episode Synopsis

More episodes of the podcast YAAP (Yet Another AI Podcast)

Free Internet, a prediction in Nostradamus style

Email on your own domain, luxury or need?

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD