Why Not Just Train For Interpretability?

22/11/2025 8 min

Listen "Why Not Just Train For Interpretability?"

Descargar episodio Ver en sitio original

Episode Synopsis

Simplicio: Hey I’ve got an alignment research idea to run by you. Me: … guess we’re doing this again. Simplicio: Interpretability work on trained nets is hard, right? So instead of that, what if we pick an architecture and/or training objective to produce interpretable nets right from the get-go? Me: If we had the textbook of the future on hand, then maybe. But in practice, you’re planning to use some particular architecture and/or objective which will not work. Simplicio: That sounds like an empirical question! We can’t know whether it works until we try it. And I haven’t thought of any reason it would fail. Me: Ok, let's get concrete here. What architecture and/or objective did you have in mind? Simplicio: Decision trees! They’re highly interpretable, and my decision theory textbook says they’re fully general in principle. So let's just make a net tree-shaped, and train that! Or, if that's not quite general enough, we train a bunch of tree-shaped nets as “experts” and then mix them somehow. Me: Turns out we’ve tried that one! It's called a random forest, it was all the rage back in the 2000's. Simplicio: So we just go back to that? Me: Alas [...] ---
First published:
November 21st, 2025

Source:
https://www.lesswrong.com/posts/2HbgHwdygH6yeHKKq/why-not-just-train-for-interpretability
---
Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (30+ Karma)

“Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins” by Michaël Trazzi 20/12/2025

“Opinionated Takes on Meetups Organizing” by jenn 20/12/2025

“AI #147: Flash Forward” by Zvi 20/12/2025

“When Were Things The Best?” by Zvi 20/12/2025

“A Full Epistemic Stack: Knowledge Commons for the 21st Century” by Oliver Sourbut, Ben Goldhaber 20/12/2025

“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout 19/12/2025

“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H 19/12/2025

“Neuro-scaffold” by DirectedEvolution 19/12/2025

“Wuckles!” by Raemon 19/12/2025

“Scalable End-to-End Interpretability” by jsteinhardt 19/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Why Not Just Train For Interpretability?

Listen "Why Not Just Train For Interpretability?"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

White Hat Hacking, Ethical Hackers…

Prevent Attacks From Your Local Area Network

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD