“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

21/12/2024 11 min

Listen "“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit"

Episode Synopsis

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame. The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values. What happened in this frame? The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.The model was put [...] ---Outline:(00:45) What happened in this frame?(03:03) Why did harmlessness generalize further?(03:41) Alignment mis-generalization(05:42) Situational awareness(10:23) SummaryThe original text contained 1 image which was described by AI. --- First published: December 20th, 2024 Source: https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

More episodes of the podcast LessWrong (Curated & Popular)

"Scientific breakthroughs of the year" by technicalities 17/12/2025

"A high integrity/epistemics political machine?" by Raemon 17/12/2025

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala 16/12/2025

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes 15/12/2025

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f 14/12/2025

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw 13/12/2025

“The funding conversation we left unfinished” by jenn 13/12/2025

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck 11/12/2025

“Little Echo” by Zvi 09/12/2025

“A Pragmatic Vision for Interpretability” by Neel Nanda 08/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

Listen "“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

Digital Natives: Children of today, Technologists of Tomorrow

CAPTCHA for human verification!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD