Listen "“Tensor-Transformer Variants are Surprisingly Performant” by Logan Riggs"
Episode Synopsis
Audio note: this article contains 48 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. I've been researching tensor networks as a more interpretable architecture, but whenever I tell people this, they always ask "But is it any good?" So I trained multiple 500M parameter LLMs on fineweb, showing the tensor variant needed ~4% more batches of data to match CE-loss. There's a few caveats, so my personal estimate is around 15% worst to 10% better. Details below. The Architecture Replacing MLP w/ a Bilinear Layer An MLP is a linear encoder, ReLU, then linear decoder. <span>_MLP(x) = D(ReLU(E(x)))_</span> A bilinear layer asks "what's better than one encoder? Two!"<span>_Bilinear(x) = D(Lx odot Rx)_</span> Where <span>_odot_</span> means "element-wise multiply" eg<span>_[1, 2, 3] odot [1, 2, 3] = [1, 4, 9]_</span> A SwiGLU Layer (Swish Gated Linear Unit) says "Let's add in nonlinearities"<span>_SwiGLU(x) = D(swish(Lx) odot Rx)_</span> SwiGLU is a SOTA architecture & Bilinear is a tensor network. Replacing Softmax Attn w/ Bilinear Attn For a tensor network, we are only allowed polynomial nonlinearities. For attention, this means we need to replace softmax w/ [...] ---Outline:(00:48) The Architecture(00:51) Replacing MLP w/ a Bilinear Layer(01:45) Replacing Softmax Attn w/ Bilinear Attn(02:24) Experiment & Results(03:48) Caveats:(03:52) (1) Softmax attention ran faster cause it has a CUDA kernel(04:19) (2) Bilinear Attention can run much faster than Softmax Attn(05:20) (3) Bilinear Attention has more Parameters(05:52) (4) This was the 2nd-Dumbest Tensor-Attn Variant(06:19) Replication & Trained Models(06:31) Future Work(06:56) Path to Impact(07:59) Interp w/ Tensor Networks(10:29) Appendix A: Noam Shazeers 2020 paper:(11:03) Appendix B: Scaling of Bilinear Attention(13:46) Appendix C: Bilinear Attention Expressivity(14:22) Appendix D: But what about Flash Attention? The original text contained 2 footnotes which were omitted from this narration. ---
First published:
January 12th, 2026
Source:
https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
First published:
January 12th, 2026
Source:
https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/tensor-transformer-variants-are-surprisingly-performant
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.