“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

21/12/2024 11 min
“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

Listen "“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit"

Episode Synopsis

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame. The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values. What happened in this frame? The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.The model was put [...] ---Outline:(00:45) What happened in this frame?(03:03) Why did harmlessness generalize further?(03:41) Alignment mis-generalization(05:42) Situational awareness(10:23) SummaryThe original text contained 1 image which was described by AI. --- First published: December 20th, 2024 Source: https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

More episodes of the podcast LessWrong (Curated & Popular)