Ep.10 Are benchmarks broken?

22/06/2025 56 min Episodio 12

Listen "Ep.10 Are benchmarks broken?"

Descargar episodio Ver en sitio original

Episode Synopsis

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?

18:20 - Automated evaluation

20:10 - Benchmarks

37:45 - Human feedback

44:50 - LLM as judge

about the projects we discuss here:

Meditron

Learn about the MOOVE or contact our team if you'd like to be involved

Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper

More details in the show notes on our website.
Episodes | Bluesky | [email protected]

More episodes of the podcast Medical Attention

Should we be using LLMs for discharge summarisation? 09/10/2025

In-context: September 4, 2025 04/09/2025

In-context: August 18, 2025 18/08/2025

In-context: July 20, 2025 20/07/2025

In-context: June 9, 2025 10/06/2025

In-context: May 2025 27/05/2025

Ep.9 AI Mythbusting 10/05/2025

Ep.8 Algorithmic Bias 17/01/2025

Ep.7 Informatics Year in Review 17/12/2024

Ep.6 Human-computer interaction in healthcare 17/09/2024

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Ep.10 Are benchmarks broken?

Listen "Ep.10 Are benchmarks broken?"

Episode Synopsis

More episodes of the podcast Medical Attention

Googling with breathtaking tricks you ignore

Do you work sitting down? Do active breaks

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD