ATTENTION2D and lean attention: Distributed Self-Attention

29/10/2025 31 min

Listen "ATTENTION2D and lean attention: Distributed Self-Attention"

Descargar episodio Ver en sitio original

Episode Synopsis

We cover two new innovations from Microsoft extending ideas from the original old **FlashAttention**. Flash Attention is an IO-aware attention algorithm for Transformers designed to address the quadratic time and memory complexity of standard self-attention on long sequences. By using **tiling and recomputation** to minimize slow **High Bandwidth Memory (HBM)** accesses in favor of fast **on-chip SRAM**, FlashAttention achieves significant wall-clock speedups for training models like BERT and GPT-2, enabling them to handle much longer context lengths. Microsoft's new **ATTENTION2D** is a technique that builds upon memory-efficient methods like FlashAttention to optimize **distributed self-attention** across multiple GPUs, achieving parallelism in two dimensions (Q-DIM and KV-DIM) to overcome the communication bottleneck inherent in prior single-dimension parallel approaches like Ring Attention. Microsoft's additional contribution to the research community is **Lean Attention**, which also appears to propose a high-performance, tiled execution strategy for attention, using shared memory and iterative computation, similar to the IO-aware concepts in the other sources.Sources:The original flag attention paper:https://arxiv.org/pdf/2205.14135Flash attention 2 paper:https://arxiv.org/pdf/2307.08691June 28, 2025 Microsoft's Attention2D:https://arxiv.org/pdf/2503.15758Microsoft's Lean attention:https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/Lean_Attention___arxiv_version.pdf

More episodes of the podcast AI: post transformers

Scaling laws: long context length and in context learning 17/01/2026

DeepSeek Engram: Scaling Large Language Models via Conditional Memory Lookup 14/01/2026

PageANN: Scalable Disk ANNS with Page-Aligned Graphs 07/12/2025

NeurIPS 2025: Homogeneous Keys, Heterogeneous Values 04/12/2025

NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free 29/11/2025

NeurIPS 2025: Large Language Diffusion Models 29/11/2025

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example 29/11/2025

NeurIPS 2025: Parallel Scaling Law for Language Models 29/11/2025

NeurIPS 2025: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data 29/11/2025

NeurIPS 2025: DYNAACT: Large Language Model Reasoning with Dynamic Action Spaces 29/11/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

ATTENTION2D and lean attention: Distributed Self-Attention

Listen "ATTENTION2D and lean attention: Distributed Self-Attention"

Episode Synopsis

More episodes of the podcast AI: post transformers

White Hat Hacking, Ethical Hackers…

Increase the rate of email delivery

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD