Listen ""Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds"
Episode Synopsis
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.Source:https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-withNarrated for LessWrong by TYPE III AUDIO.Share feedback on this narration.[125+ Karma Post] ✓
More episodes of the podcast LessWrong (Curated & Popular)
“Little Echo” by Zvi
09/12/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.