“On Emergent Misalignment” by Zvi

28/02/2025 41 min

Listen "“On Emergent Misalignment” by Zvi"

Descargar episodio Ver en sitio original

Episode Synopsis

One hell of a paper dropped this week.
It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.
More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.
The misalignment here is not subtle.

There are even more examples here, the whole thing is wild.
This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.

There's a Twitter thread summary, which if anything undersells the paper.
Ethan Mollick: This [...] ---Outline:(01:27) Paper Abstract(03:22) Funny You Should Ask(04:58) Isolating the Cause(08:39) No, You Did Not Expect This(12:37) Antinormativity is Totally a Thing(16:15) What Hypotheses Explain the New Persona(20:59) A Prediction of Correlational Sophistication(23:27) Good News, Everyone(31:00) Bad News(36:26) No One Would Be So Stupid As To(38:23) Orthogonality(40:19) The Lighter Side ---
First published:
February 28th, 2025

Source:
https://www.lesswrong.com/posts/7BEcAzxCXenwcjXuE/on-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“On Emergent Misalignment” by Zvi

Listen "“On Emergent Misalignment” by Zvi"

Episode Synopsis

More episodes of the podcast LessWrong posts by zvi

Internet as human right and its scope

Free Internet, a prediction in Nostradamus style

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD