Listen "Why Not Just Train For Interpretability?"
Episode Synopsis
Simplicio: Hey I’ve got an alignment research idea to run by you. Me: … guess we’re doing this again. Simplicio: Interpretability work on trained nets is hard, right? So instead of that, what if we pick an architecture and/or training objective to produce interpretable nets right from the get-go? Me: If we had the textbook of the future on hand, then maybe. But in practice, you’re planning to use some particular architecture and/or objective which will not work. Simplicio: That sounds like an empirical question! We can’t know whether it works until we try it. And I haven’t thought of any reason it would fail. Me: Ok, let's get concrete here. What architecture and/or objective did you have in mind? Simplicio: Decision trees! They’re highly interpretable, and my decision theory textbook says they’re fully general in principle. So let's just make a net tree-shaped, and train that! Or, if that's not quite general enough, we train a bunch of tree-shaped nets as “experts” and then mix them somehow. Me: Turns out we’ve tried that one! It's called a random forest, it was all the rage back in the 2000's. Simplicio: So we just go back to that? Me: Alas [...] ---
First published:
November 21st, 2025
Source:
https://www.lesswrong.com/posts/2HbgHwdygH6yeHKKq/why-not-just-train-for-interpretability
---
Narrated by TYPE III AUDIO.
First published:
November 21st, 2025
Source:
https://www.lesswrong.com/posts/2HbgHwdygH6yeHKKq/why-not-just-train-for-interpretability
---
Narrated by TYPE III AUDIO.
More episodes of the podcast LessWrong (30+ Karma)
“AI #147: Flash Forward” by Zvi
20/12/2025
“When Were Things The Best?” by Zvi
20/12/2025
“A Full Epistemic Stack: Knowledge Commons for the 21st Century” by Oliver Sourbut, Ben Goldhaber
20/12/2025
“2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target” by TurnTrout
19/12/2025
“In defence of the human agency: “Curing Cancer” is the new “Think of the Children”” by Rajmohan H
19/12/2025
“Neuro-scaffold” by DirectedEvolution
19/12/2025
“Wuckles!” by Raemon
19/12/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.