"How to game the METR plot" by shash42

21/12/2025 12 min

Listen ""How to game the METR plot" by shash42"

Episode Synopsis

TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR's underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR's assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop! 14 prompts ruled AI discourse in 2025 The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work). However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon's) the METR plot has influenced significant investment [...] ---Outline:(01:24) 14. prompts ruled AI discourse in 2025(04:58) To improve METR horizon length, train on cybersecurity contests(07:12) HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths --- First published: December 20th, 2025 Source: https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot --- Narrated by TYPE III AUDIO. ---Images from the article:

More episodes of the podcast LessWrong (Curated & Popular)

"Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers" by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans 20/12/2025

"Scientific breakthroughs of the year" by technicalities 17/12/2025

"A high integrity/epistemics political machine?" by Raemon 17/12/2025

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala 16/12/2025

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes 15/12/2025

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f 14/12/2025

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw 13/12/2025

“The funding conversation we left unfinished” by jenn 13/12/2025

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck 11/12/2025

“Little Echo” by Zvi 09/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

"How to game the METR plot" by shash42

Listen ""How to game the METR plot" by shash42"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

Internet as human right and its scope

Internet Predators on the prowl

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD