“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

22/07/2025 10 min

Listen "“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans"

Episode Synopsis

Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model. 📄Paper, 💻Code, 🐦Twitter Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog. Introduction Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a [...] ---Outline:(01:11) Introduction(03:20) Experiment design(03:53) Results(05:03) What explains our results?(05:07) Did we fail to filter the data?(06:59) Beyond LLMs: subliminal learning as a general phenomenon(07:54) Implications for AI safety(08:42) In summary--- First published: July 22nd, 2025 Source: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via --- Narrated by TYPE III AUDIO. ---Images from the article:

More episodes of the podcast LessWrong (Curated & Popular)

"Scientific breakthroughs of the year" by technicalities 17/12/2025

"A high integrity/epistemics political machine?" by Raemon 17/12/2025

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala 16/12/2025

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes 15/12/2025

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f 14/12/2025

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw 13/12/2025

“The funding conversation we left unfinished” by jenn 13/12/2025

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck 11/12/2025

“Little Echo” by Zvi 09/12/2025

“A Pragmatic Vision for Interpretability” by Neel Nanda 08/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

Listen "“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

Prevent Attacks From Your Local Area Network

Internet Predators on the prowl

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD