“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

25/08/2025 13 min

Listen "“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout"

Episode Synopsis

Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surprising effect via a form of context distillation, which we call re-contextualization: Generate model completions with a hack-encouraging system prompt + neutral user prompt. Filter the completions to remove hacks. Train on these prompt-completion pairs with the system prompt removed. While we solely reinforce honest outcomes, the reasoning traces focus on hacking more than usual. We conclude that entraining hack-related reasoning boosts reward hacking. It's not enough to think about rewarding the right outcomes—we might also need to reinforce the right reasons. Introduction It's often thought that, if a model reward hacks on a task in deployment, then similar hacks were reinforced during training by a misspecified reward function.[1] In METR's report on reward hacking [...] ---Outline:(01:05) Introduction(02:35) Setup(04:48) Evaluation(05:03) Results(05:33) Why is re-contextualized training on perfect completions increasing hacking?(07:44) What happens when you train on purely hack samples?(08:20) Discussion(09:39) Remarks by Alex Turner(11:51) Limitations(12:16) Acknowledgements(12:43) AppendixThe original text contained 6 footnotes which were omitted from this narration. --- First published: August 14th, 2025 Source: https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despite-perfect-labels --- Narrated by TYPE III AUDIO. ---Images from the article:

More episodes of the podcast LessWrong (Curated & Popular)

"Scientific breakthroughs of the year" by technicalities 17/12/2025

"A high integrity/epistemics political machine?" by Raemon 17/12/2025

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala 16/12/2025

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes 15/12/2025

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f 14/12/2025

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw 13/12/2025

“The funding conversation we left unfinished” by jenn 13/12/2025

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck 11/12/2025

“Little Echo” by Zvi 09/12/2025

“A Pragmatic Vision for Interpretability” by Neel Nanda 08/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

Listen "“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

7 Advices to Prevent Identity Theft

Positive Attitude, Share your ZARZA Attitude!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD