“Understanding and Controlling LLM Generalization” by Daniel Tan

15/11/2025 3 min

Listen "“Understanding and Controlling LLM Generalization” by Daniel Tan"

Descargar episodio Ver en sitio original

Episode Synopsis

A distillation of my long-term research agenda and current thinking. I welcome takes on this. Why study generalization? I'm interested in studying how LLMs generalise - when presented with multiple policies that achieve similar loss, which ones tend to be learned by default? I claim this is pretty important for AI safety: Re: developing safe general intelligence, we will never be able to train LLM on all the contexts it will see at deployment. To prevent goal misgeneralization, it's necessary to understand how LLMs generalise their training OOD. Re: loss of control risks specifically, certain important kinds of misalignment (reward hacking, scheming) are difficult to 'select against' at the behavioural level. A fallback for this would be if LLMs had an innate 'generalization propensity' to learn aligned policies over misaligned ones. This motivates research into LLM inductive biases. Or as I'll call them from here on, 'generalization propensities'. I have two high-level goals: Understanding the complete set of causal factors that drive generalization. Controlling generalization by intervening on these causal factors in a principled way. Defining "generalization propensity" To study generalization propensities, we need two things: "Generalization propensity evaluations" (GPEs) [...] ---Outline:(00:18) Why study generalization?(01:30) Defining generalization propensity(02:29) Research questions ---
First published:
November 14th, 2025

Source:
https://www.lesswrong.com/posts/ZSQaT2yxNNZ3eLxRd/understanding-and-controlling-llm-generalization
---
Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (30+ Karma)

“BashArena: A Control Setting for Highly Privileged AI Agents” by james.lucassen, Adam Kaufman 18/12/2025

“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers” by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans 18/12/2025

“A basic case for donating to the Berkeley Genomics Project” by TsviBT 18/12/2025

“Announcing RoastMyPost” by ozziegooen 17/12/2025

“The Bleeding Mind” by Adele Lopez 17/12/2025

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub 17/12/2025

“Still Too Soon” by Gordon Seidoh Worley 17/12/2025

“Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy” by JenniferRM 17/12/2025

“Mistakes in the Moonshot Alignment Program and What we’ll improve for next time” by Kabir Kumar 17/12/2025

“Dancing in a World of Horseradish” by lsusr 17/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Understanding and Controlling LLM Generalization” by Daniel Tan

Listen "“Understanding and Controlling LLM Generalization” by Daniel Tan"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

Bandwidth: Broadband or Narrowband?

Preparing for a Hacker Threat

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD