Listen "“On Emergent Misalignment” by Zvi"
Episode Synopsis
One hell of a paper dropped this week.
It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.
More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.
The misalignment here is not subtle.
There are even more examples here, the whole thing is wild.
This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.
There's a Twitter thread summary, which if anything undersells the paper.
Ethan Mollick: This [...] ---Outline:(01:27) Paper Abstract(03:22) Funny You Should Ask(04:58) Isolating the Cause(08:39) No, You Did Not Expect This(12:37) Antinormativity is Totally a Thing(16:15) What Hypotheses Explain the New Persona(20:59) A Prediction of Correlational Sophistication(23:27) Good News, Everyone(31:00) Bad News(36:26) No One Would Be So Stupid As To(38:23) Orthogonality(40:19) The Lighter Side ---
First published:
February 28th, 2025
Source:
https://www.lesswrong.com/posts/7BEcAzxCXenwcjXuE/on-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.
More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.
The misalignment here is not subtle.
There are even more examples here, the whole thing is wild.
This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.
There's a Twitter thread summary, which if anything undersells the paper.
Ethan Mollick: This [...] ---Outline:(01:27) Paper Abstract(03:22) Funny You Should Ask(04:58) Isolating the Cause(08:39) No, You Did Not Expect This(12:37) Antinormativity is Totally a Thing(16:15) What Hypotheses Explain the New Persona(20:59) A Prediction of Correlational Sophistication(23:27) Good News, Everyone(31:00) Bad News(36:26) No One Would Be So Stupid As To(38:23) Orthogonality(40:19) The Lighter Side ---
First published:
February 28th, 2025
Source:
https://www.lesswrong.com/posts/7BEcAzxCXenwcjXuE/on-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
More episodes of the podcast LessWrong posts by zvi
“Claude Codes” by Zvi
09/01/2026
“AI #150: While Claude Codes” by Zvi
08/01/2026
“Advancements In Self-Driving Cars” by Zvi
07/01/2026
“Dos Capital” by Zvi
05/01/2026
“Fertility Roundup #5: Causation” by Zvi
02/01/2026
“AI #149: 3” by Zvi
01/01/2026
“2025 Year in Review” by Zvi
31/12/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.