Listen "Ep.10 Are benchmarks broken?"
Episode Synopsis
In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?
18:20 - Automated evaluation
20:10 - Benchmarks
37:45 - Human feedback
44:50 - LLM as judge
about the projects we discuss here:
Meditron
Learn about the MOOVE or contact our team if you'd like to be involved
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper
More details in the show notes on our website.
Episodes | Bluesky | [email protected]
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?
18:20 - Automated evaluation
20:10 - Benchmarks
37:45 - Human feedback
44:50 - LLM as judge
about the projects we discuss here:
Meditron
Learn about the MOOVE or contact our team if you'd like to be involved
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper
More details in the show notes on our website.
Episodes | Bluesky | [email protected]
More episodes of the podcast Medical Attention
In-context: September 4, 2025
04/09/2025
In-context: August 18, 2025
18/08/2025
In-context: July 20, 2025
20/07/2025
In-context: June 9, 2025
10/06/2025
In-context: May 2025
27/05/2025
Ep.9 AI Mythbusting
10/05/2025
Ep.8 Algorithmic Bias
17/01/2025
Ep.7 Informatics Year in Review
17/12/2024
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.