“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

21/11/2025 18 min

Listen "“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato"

Descargar episodio Ver en sitio original

Episode Synopsis

Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned. Twitter thread New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. In our experiment, we [...] ---Outline:(00:14) Abstract(01:26) Twitter thread(05:23) Blog post(07:13) From shortcuts to sabotage(12:20) Why does reward hacking lead to worse behaviors?(13:21) Mitigations ---
First published:
November 21st, 2025

Source:
https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

More episodes of the podcast LessWrong (30+ Karma)

“A Full Epistemic Stack: Knowledge Commons for the 21st Century” by Oliver Sourbut, Ben Goldhaber 20/12/2025

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout 19/12/2025

“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H 19/12/2025

“Neuro-scaffold” by DirectedEvolution 19/12/2025

“Wuckles!” by Raemon 19/12/2025

“Scalable End-to-End Interpretability” by jsteinhardt 19/12/2025

“Help keep AI under human control: Palisade Research 2026 fundraiser” by Jeffrey Ladish, benwr, Eli Tyre, John Steidley 19/12/2025

“BashArena: A Control Setting for Highly Privileged AI Agents” by james.lucassen, Adam Kaufman 18/12/2025

“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers” by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans 18/12/2025

“A basic case for donating to the Berkeley Genomics Project” by TsviBT 18/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato

Listen "“Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

Internet as human right and its scope

Prevent Attacks From Your Local Area Network

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD