“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda

01/12/2025 33 min

Listen "“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda"

Descargar episodio Ver en sitio original

Episode Synopsis

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post
[[1]]
, and are excited for more in the field to embrace pragmatism! In brief, we think that:

It is crucial to have empirical feedback on your ultimate goal with good proxy tasks
[[2]]
.
We do not need near-complete understanding to have significant impact.
We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting

But that's pretty abstract. So how can interpretability help AGI go well? A few theories of change stand out to us:

Science of Misalignment: If a model takes a harmful action, we want to be able to rigorously determine whether it was “scheming” or just “confused”
[[3]]

Empowering Other Areas Of Safety: Interpretability is not a silver bullet that will solve safety by itself, but can significantly help other areas by unblocking things or addressing weak points where appropriate, e.g. suppressing eval awareness, or interpreting what safety techniques taught a model. [...] ---Outline:(00:11) Executive Summary(02:57) Theories Of Change(04:25) Science Of Misalignment(06:59) Empowering Other Areas Of AGI Safety(07:17) Evaluation Awareness(07:53) Better Feedback On Safety Research(08:11) Conceptual Progress On Model Psychology(08:44) Maintaining Monitorability Of Chain Of Thought(09:20) Preventing Egregiously Misaligned Actions(11:20) Directly Helping Align Models(13:30) Research Areas Directed Towards Theories Of Change(13:53) Model Biology(15:48) Helping Direct Model Training(15:55) Monitoring(17:10) Research Areas About Robustly Useful Settings(18:00) Reasoning Model Interpretability(22:42) Automating Interpretability(23:51) Basic Science Of AI Psychology(24:08) Finding Good Proxy Tasks(25:05) Discovering Unusual Behaviours(26:12) Data-Centric Interpretability(27:16) Model Diffing(28:05) Applied Interpretability(29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques} The original text contained 16 footnotes which were omitted from this narration. ---
First published:
December 1st, 2025

Source:
https://www.lesswrong.com/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well
---
Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (30+ Karma)

“Announcing RoastMyPost” by ozziegooen 17/12/2025

“The Bleeding Mind” by Adele Lopez 17/12/2025

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub 17/12/2025

“Still Too Soon” by Gordon Seidoh Worley 17/12/2025

“Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy” by JenniferRM 17/12/2025

“Mistakes in the Moonshot Alignment Program and What we’ll improve for next time” by Kabir Kumar 17/12/2025

“Dancing in a World of Horseradish” by lsusr 17/12/2025

[Linkpost] “Announcing: MIRI Technical Governance Team Research Fellowship” by yams, peterbarnett, Aaron_Scher, Robi Rahman 17/12/2025

“Radiology Automation Does Not Generalize to Other Jobs” by Xodarap 16/12/2025

“GPT-5.2 Is Frontier Only For The Frontier” by Zvi 16/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda

Listen "“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

Free Internet, a prediction in Nostradamus style

Orthographic errors in Web pages

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD